<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Dhananjay Lakkawar</title>
    <description>The latest articles on Forem by Dhananjay Lakkawar (@dhananjay_lakkawar).</description>
    <link>https://forem.com/dhananjay_lakkawar</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3826432%2Fbdc9e69e-0a89-4399-9157-84d9089aaa30.png</url>
      <title>Forem: Dhananjay Lakkawar</title>
      <link>https://forem.com/dhananjay_lakkawar</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/dhananjay_lakkawar"/>
    <language>en</language>
    <item>
      <title>The Open-Source Alternative to Oracle 26ai: Why PostgreSQL is All You Need</title>
      <dc:creator>Dhananjay Lakkawar</dc:creator>
      <pubDate>Thu, 02 Apr 2026 20:03:14 +0000</pubDate>
      <link>https://forem.com/dhananjay_lakkawar/the-open-source-alternative-to-oracle-26ai-why-postgresql-is-all-you-need-3dcn</link>
      <guid>https://forem.com/dhananjay_lakkawar/the-open-source-alternative-to-oracle-26ai-why-postgresql-is-all-you-need-3dcn</guid>
      <description>&lt;p&gt;The database industry is currently undergoing a massive identity crisis. Driven by the Generative AI boom, legacy database vendors are rushing to reinvent themselves as the ultimate "all-in-one" AI platforms. &lt;/p&gt;

&lt;p&gt;The most recent, and perhaps most aggressive, example of this is &lt;strong&gt;Oracle AI Database 26ai&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;With the launch of 26ai, Oracle has made a very clear architectural statement: &lt;em&gt;The database should be the center of gravity for enterprise AI.&lt;/em&gt; They have embedded LLMs directly into the database engine, introduced native vector storage, and built the "Oracle Unified Memory Core" to provide persistent state for AI agents. They converge JSON, graph, vector, and relational data into a single, highly governed monolith.&lt;/p&gt;

&lt;p&gt;If you are a legacy enterprise with two decades of PL/SQL technical debt and heavy regulatory requirements, this makes a lot of sense. &lt;/p&gt;

&lt;p&gt;But if you are a startup founder, a scale-up CTO, or a cloud-native engineering team, adopting a monolithic, proprietary "AI Database" is a fast track to severe vendor lock-in and catastrophic licensing costs. &lt;/p&gt;

&lt;p&gt;As a cloud architect, I have a completely different philosophy. &lt;strong&gt;You do not need a proprietary AI database. You just need PostgreSQL, &lt;code&gt;pgvector&lt;/code&gt;, and scalable AWS cloud primitives.&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;Here is why PostgreSQL is the only AI database you actually need, and how to architect the open-source alternative to Oracle 26ai on AWS.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Myth of the "AI-Native" Monolith
&lt;/h2&gt;

&lt;p&gt;Oracle 26ai pushes the idea of running AI models and agentic workflows &lt;em&gt;directly inside the database container&lt;/em&gt; to eliminate data movement and avoid the "integration tax" of modern AI stacks. &lt;/p&gt;

&lt;p&gt;From an engineering perspective, this violates one of the core principles of modern system design: &lt;strong&gt;the separation of compute and storage.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Coupling unpredictable, highly intensive LLM inference compute with your mission-critical transactional database is an operational risk. If an AI agent hallucinates or gets stuck in a reasoning loop, you do not want it consuming the CPU cycles required to process your core user transactions.&lt;/p&gt;

&lt;p&gt;Instead, we can use &lt;strong&gt;Amazon Aurora PostgreSQL&lt;/strong&gt; paired with &lt;strong&gt;Amazon Bedrock&lt;/strong&gt; to achieve the exact same "converged" AI capabilities, but with a decoupled, modular, and infinitely more cost-effective architecture.&lt;/p&gt;

&lt;h3&gt;
  
  
  Architectural Comparison: Monolithic vs. Composable
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffc9uy1mvj93ab79j5yyp.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffc9uy1mvj93ab79j5yyp.gif" alt="frist" width="600" height="337"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Deconstructing 26ai Features with PostgreSQL
&lt;/h2&gt;

&lt;p&gt;Let’s break down the major selling points of proprietary AI databases and look at how the open-source ecosystem handles them natively today.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Vector Search &amp;amp; Similarity
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The Proprietary Claim:&lt;/strong&gt; You need a specialized engine or a massive vendor upgrade to handle vector search securely alongside relational data.&lt;br&gt;
&lt;strong&gt;The PostgreSQL Reality:&lt;/strong&gt; The open-source &lt;code&gt;pgvector&lt;/code&gt; extension has already won the vector database war. Running on Amazon Aurora, &lt;code&gt;pgvector&lt;/code&gt; utilizes Hierarchical Navigable Small World (HNSW) indexing to execute sub-millisecond similarity searches across millions of embeddings. You can join your vectors against standard relational tables in a single SQL query—no expensive licensing required.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Multi-Model Data (JSON, Graph, Relational)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The Proprietary Claim:&lt;/strong&gt; Modern apps need a single engine that syncs JSON documents, graphs, and relational tables.&lt;br&gt;
&lt;strong&gt;The PostgreSQL Reality:&lt;/strong&gt; PostgreSQL has been doing this for a decade. The &lt;code&gt;JSONB&lt;/code&gt; data type handles unstructured document data with indexing capabilities that rival dedicated NoSQL databases. If you need graph capabilities, Apache AGE brings graph queries directly into Postgres. It is the ultimate converged database.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. In-Database AI &amp;amp; Agent Orchestration
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The Proprietary Claim:&lt;/strong&gt; Running LLMs inside the database natively is faster and more secure.&lt;br&gt;
&lt;strong&gt;The PostgreSQL Reality:&lt;/strong&gt; If you &lt;em&gt;really&lt;/em&gt; want your database to invoke AI models without moving data, Amazon Aurora PostgreSQL provides the &lt;code&gt;aws_ml&lt;/code&gt; extension. This allows you to write standard SQL queries that securely invoke Amazon Bedrock directly from the database engine. &lt;/p&gt;

&lt;p&gt;However, in 90% of real-world use cases, &lt;strong&gt;you shouldn't do this.&lt;/strong&gt; It is architecturally safer to keep your agentic orchestration in a stateless compute layer (like AWS Lambda or Step Functions) and treat PostgreSQL strictly as your robust, highly-available storage engine.&lt;/p&gt;




&lt;h2&gt;
  
  
  Building the Composable RAG Architecture on AWS
&lt;/h2&gt;

&lt;p&gt;When you decouple your AI from your database, your Retrieval-Augmented Generation (RAG) architecture becomes incredibly flexible. You aren't locked into Oracle's specific LLM partnerships or pricing models. You can swap out a Claude 3.5 model for a Llama 3 model in Amazon Bedrock with a single line of code, while your PostgreSQL database remains completely untouched.&lt;/p&gt;

&lt;p&gt;Here is what the standard production RAG flow looks like on AWS:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7qyky1f4wrlrt56byj90.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7qyky1f4wrlrt56byj90.gif" alt="second" width="200" height="112"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The CTO Perspective: Build vs. Buy and the Economics of AI
&lt;/h2&gt;

&lt;p&gt;As a technology leader, choosing your database is the most consequential decision you will make. It dictates your hiring, your hosting costs, and your long-term agility.&lt;/p&gt;

&lt;p&gt;Proprietary AI databases operate on the "convenience tax" model. They promise to reduce the complexity of wiring together different AI components, but the tradeoff is total vendor capture. &lt;/p&gt;

&lt;p&gt;Here is why building on open-source PostgreSQL is the only logical choice for cloud-native teams:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Talent Density
&lt;/h3&gt;

&lt;p&gt;Every competent backend engineer knows Postgres. You don't need to hire specialized, highly-paid DBAs to manage proprietary AI syntax.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. True Cloud Economics
&lt;/h3&gt;

&lt;p&gt;With Amazon Aurora Serverless v2, your database automatically scales up during high-traffic AI inference events and scales down to practically nothing at midnight. &lt;/p&gt;

&lt;h3&gt;
  
  
  3. Future-Proofing
&lt;/h3&gt;

&lt;p&gt;The AI landscape changes every three weeks. By keeping your data in standard, open-source PostgreSQL and handling AI via Amazon Bedrock, you can rapidly adopt next month's breakthrough model without needing a database migration.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. The "Lock-in" Economic Risk
&lt;/h3&gt;

&lt;p&gt;Architectural decisions are ultimately about &lt;strong&gt;leverage&lt;/strong&gt;. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;The Oracle Cost Risk:&lt;/strong&gt; If Oracle increases its "AI Option" license fee by 20% next year, you are trapped. Migrating a monolithic database containing your vectors, agents, and relational data is a multi-year, multi-million dollar project.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;The AWS Composable Risk:&lt;/strong&gt; If Amazon Bedrock becomes too expensive, you simply point your Lambda function to OpenAI, Anthropic, or a self-hosted Llama 3 model on an EC2 instance. Your database (Postgres) remains unchanged. &lt;em&gt;You retain price leverage over your AI providers.&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Summary Table: Estimated Monthly Spend (Mid-Sized App)
&lt;/h3&gt;

&lt;p&gt;To put this in perspective, here is a rough look at the unit economics of a mid-sized production application running a monolithic proprietary stack vs. an open-source composable stack on AWS:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Oracle 26ai&lt;/th&gt;
&lt;th&gt;AWS Composable Stack&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Database License&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$2,000+ (Subscription)&lt;/td&gt;
&lt;td&gt;$0 (Open Source)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Compute/Instance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$800 (Fixed)&lt;/td&gt;
&lt;td&gt;$200 (Aurora Serverless avg)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AI Inference&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Included in Compute&lt;/td&gt;
&lt;td&gt;$100 (Token-based)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Orchestration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;In-DB (Fixed)&lt;/td&gt;
&lt;td&gt;$10 (Lambda/Step Functions)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total Est. Monthly&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$2,800/mo&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$310/mo&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Final Verdict: Beware the Gold-Plated Handcuffs
&lt;/h2&gt;

&lt;p&gt;The AWS Architecture described in this blog is approximately 80-90% more cost-effective for new builds, startups, and scale-ups. &lt;/p&gt;

&lt;p&gt;Oracle 26ai only becomes "cost-effective" when the cost of migrating away from an existing Oracle ecosystem exceeds the exorbitant licensing fees a situation often referred to in enterprise IT as the "Gold-Plated Handcuffs."&lt;/p&gt;

&lt;p&gt;Oracle 26ai is an impressive piece of engineering designed to keep enterprise data exactly where it is. But for teams building the next generation of software, AI does not need to be a proprietary database feature. &lt;/p&gt;

&lt;p&gt;By combining the rock-solid reliability of PostgreSQL with the raw power of AWS cloud primitives, you can build massively scalable, AI-native applications without ever sacrificing your budget or your architectural freedom.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Are you running your vector workloads inside PostgreSQL, or did you adopt a dedicated vector database? Let's discuss the tradeoffs in the comments below!&lt;/em&gt;&lt;/p&gt;




</description>
      <category>ai</category>
      <category>database</category>
      <category>opensource</category>
      <category>postgres</category>
    </item>
    <item>
      <title>The 15-Millisecond AI: Building "Pre-Cognitive" Edge Caching on AWS</title>
      <dc:creator>Dhananjay Lakkawar</dc:creator>
      <pubDate>Sun, 29 Mar 2026 19:17:07 +0000</pubDate>
      <link>https://forem.com/dhananjay_lakkawar/the-15-millisecond-ai-building-pre-cognitive-edge-caching-on-aws-ad7</link>
      <guid>https://forem.com/dhananjay_lakkawar/the-15-millisecond-ai-building-pre-cognitive-edge-caching-on-aws-ad7</guid>
      <description>&lt;p&gt;If you want to watch a product manager's soul leave their body, sit in on a live demo of a Generative AI feature where the model takes 12 seconds to generate a response. &lt;/p&gt;

&lt;p&gt;&lt;em&gt;Typing... typing... typing...&lt;/em&gt; &lt;/p&gt;

&lt;p&gt;In the world of AI product development, &lt;strong&gt;latency is the ultimate UX killer.&lt;/strong&gt; You can have the smartest prompt and the most expensive foundational model in the world, but if your users have to stare at a spinning loading wheel for 10 seconds every time they click a button, they will abandon your app. &lt;/p&gt;

&lt;p&gt;Most engineering teams try to solve this by streaming tokens to the frontend or switching to smaller, less capable models. But as a cloud architect, I prefer a different approach. &lt;/p&gt;

&lt;p&gt;What if we stopped waiting for the user to ask the question? &lt;/p&gt;

&lt;p&gt;What if we used the user's application state to predict what they are going to ask, generated the answer in the background, and pushed it to a CDN edge location before their mouse even hovers over the button?&lt;/p&gt;

&lt;p&gt;When I sketch this out for engineering leaders, the reaction is almost always the same: &lt;em&gt;"Wait, we can pre-generate AI responses in the background and cache them at the CDN level to completely bypass inference latency?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Yes. Here is how to build a "Pre-Cognitive" AI architecture using &lt;strong&gt;AWS Step Functions&lt;/strong&gt;, &lt;strong&gt;Amazon Bedrock&lt;/strong&gt;, and &lt;strong&gt;Amazon CloudFront with Lambda@Edge&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Concept: From Reactive AI to Proactive Caching
&lt;/h2&gt;

&lt;p&gt;Think about your favorite SaaS dashboard. When a user logs in on Monday morning, their "next best actions" are highly predictable. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;They are going to ask for a summary of weekend alerts.&lt;/li&gt;
&lt;li&gt;They are going to ask for the status of their latest deployment.&lt;/li&gt;
&lt;li&gt;They are going to ask for a draft reply to their most urgent ticket.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead of waiting for the user to click "Summarize Alerts" and forcing them to wait 8 seconds for an LLM to read the data, we move the LLM inference out of the synchronous request path and into an asynchronous background job. &lt;/p&gt;

&lt;p&gt;We generate the responses, store them as key-value pairs, and push them to the network edge. When the user finally clicks the button, the response loads in &lt;strong&gt;15 milliseconds&lt;/strong&gt;. It feels like magic. &lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture: Phase 1 (Background Generation)
&lt;/h2&gt;

&lt;p&gt;To make this work without slowing down the initial user login, we decouple the generation using an event-driven flow.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpm56kw6ro2hyk6gn52py.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpm56kw6ro2hyk6gn52py.gif" alt="frist image" width="600" height="337"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The Trigger:&lt;/strong&gt; When the user logs in (or enters a specific workflow), your backend fires an event to AWS EventBridge.&lt;br&gt;
&lt;strong&gt;2. The Orchestrator:&lt;/strong&gt; AWS Step Functions takes over. It acts as the background traffic cop, ensuring your API doesn't hang. &lt;br&gt;
&lt;strong&gt;3. The Inference:&lt;/strong&gt; A Lambda function analyzes the user's state, grabs the required context, and fires off 3 concurrent prompts to Amazon Bedrock (using a fast, cheap model like Claude 3 Haiku). &lt;br&gt;
&lt;strong&gt;4. The Edge Push:&lt;/strong&gt; Once Bedrock returns the generated text, Lambda pushes these pre-computed AI responses into &lt;strong&gt;Amazon CloudFront KeyValueStore&lt;/strong&gt; (a globally distributed datastore designed specifically for edge functions) keyed by &lt;code&gt;UserID_ActionID&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture: Phase 2 (The 15ms Delivery)
&lt;/h2&gt;

&lt;p&gt;Now, the user is looking at their dashboard. They see a button that says &lt;em&gt;"✨ Generate Morning Briefing."&lt;/em&gt; They click it.&lt;/p&gt;

&lt;p&gt;Because we are using CloudFront and Lambda@Edge (or CloudFront Functions), the request never even reaches your primary backend servers in &lt;code&gt;us-east-1&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa9m7d6l3v5w97cm0srdw.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa9m7d6l3v5w97cm0srdw.gif" alt="second video" width="200" height="112"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The Interception:&lt;/strong&gt; The user's HTTPS request hits the closest AWS Edge location (e.g., a server in London, Tokyo, or New York). Lambda@Edge intercepts the request.&lt;br&gt;
&lt;strong&gt;2. The Edge Lookup:&lt;/strong&gt; Lambda@Edge checks the attached CloudFront KeyValueStore for the user's pre-generated response. &lt;br&gt;
&lt;strong&gt;3. Instant Delivery:&lt;/strong&gt; If the response is there, it is returned instantly. The user experiences sub-20ms latency for a complex Generative AI task. &lt;br&gt;
&lt;strong&gt;4. The Fallback:&lt;/strong&gt; If the user asks a completely custom question that we didn't predict, Lambda@Edge simply forwards the request to your standard API Gateway/Bedrock backend to generate the response synchronously. &lt;/p&gt;




&lt;h2&gt;
  
  
  The CTO Perspective: Tradeoffs and Reality Checks
&lt;/h2&gt;

&lt;p&gt;As a technology strategist, I will be the first to tell you that "magic" always comes with an engineering invoice. You should only use this pattern if you understand the tradeoffs.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The Cost of Wasted Compute
&lt;/h3&gt;

&lt;p&gt;By predicting 3 things the user &lt;em&gt;might&lt;/em&gt; ask, you are generating tokens that might never be read. You are trading compute cost for user experience. &lt;br&gt;
&lt;strong&gt;The Mitigation:&lt;/strong&gt; Only use this pattern with ultra-cheap, highly efficient models like Claude 3 Haiku or Llama 3 8B. Do not use Claude 3 Opus or GPT-4o for speculative background generation, or you will torch your AWS bill.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. State Invalidation
&lt;/h3&gt;

&lt;p&gt;What happens if you pre-generate a "Deployment Summary" at 9:00 AM, but at 9:05 AM a deployment fails, and the user clicks the button at 9:06 AM? The cached AI response is now lying to them.&lt;br&gt;
&lt;strong&gt;The Mitigation:&lt;/strong&gt; Tie your cache invalidation to your application's critical state changes. If a critical DB row updates, fire an EventBridge rule that immediately deletes the stale key from the CloudFront KeyValueStore. &lt;/p&gt;

&lt;h3&gt;
  
  
  3. Build Complexity vs. Product Value
&lt;/h3&gt;

&lt;p&gt;Don't build this for a general-purpose chatbox. Humans are too unpredictable. Build this for &lt;strong&gt;highly structured, high-value UX checkpoints&lt;/strong&gt;—like daily briefings, code review summaries, or personalized dashboard greetings. &lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;When we build AI applications, we often forget that the rules of distributed systems still apply. You don't have to accept the latency of a foundational model as a fixed constraint. &lt;/p&gt;

&lt;p&gt;By aggressively predicting user intent and leveraging AWS edge networking primitives like CloudFront and Lambda@Edge, you can completely mask LLM latency. &lt;/p&gt;

&lt;p&gt;It takes your application from feeling like a "cool AI wrapper" to feeling like a deeply integrated, hyper-responsive superpower. &lt;/p&gt;




&lt;p&gt;&lt;em&gt;Have you struggled with GenAI latency in your production applications? Are you using streaming, or have you started exploring asynchronous generation? Let me know your architecture in the comments below.&lt;/em&gt;&lt;/p&gt;




</description>
      <category>aws</category>
      <category>ai</category>
      <category>serverless</category>
      <category>cloudfront</category>
    </item>
    <item>
      <title>The $50,000 Chat History Problem: Building Event-Driven AI Memory on AWS</title>
      <dc:creator>Dhananjay Lakkawar</dc:creator>
      <pubDate>Fri, 27 Mar 2026 17:46:37 +0000</pubDate>
      <link>https://forem.com/dhananjay_lakkawar/the-50000-chat-history-problem-building-event-driven-ai-memory-on-aws-48c5</link>
      <guid>https://forem.com/dhananjay_lakkawar/the-50000-chat-history-problem-building-event-driven-ai-memory-on-aws-48c5</guid>
      <description>&lt;p&gt;It was 11:00 PM on a Tuesday when my friend startup's CTO dropped a screenshot of their monthly cloud bill into the engineering Slack channel. &lt;/p&gt;

&lt;p&gt;The AWS infrastructure costs were flat. But their LLM inference API bill looked like a hockey stick pointing straight up. &lt;/p&gt;

&lt;p&gt;"Why are we burning thousands of dollars a day on Claude 3 Opus?" she asked.&lt;/p&gt;

&lt;p&gt;The lead engineer replied: "Because to make the AI assistant feel 'smart' and remember the user, we have to pass their entire conversation history into the context window for every single message. If they've been using the app for a month, we are passing 80,000 tokens just so the bot remembers their dog's name when they say 'hello'."&lt;/p&gt;

&lt;p&gt;They had fallen into the classic Generative AI trap: &lt;strong&gt;Treating the LLM's context window as a database.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;As a cloud architect, I love when we can take "boring" cloud primitives and combine them with AI to create something that feels like magic but is actually just brilliant, highly-scalable engineering. If you want to make a CTO stop in their tracks, rethink their architecture, and say, &lt;em&gt;"Wait, is this actually possible?"&lt;/em&gt; you need to move away from standard chatbots.&lt;/p&gt;

&lt;p&gt;Here is an architectural pivot that radically changes how an AI application scales, operates, and spends money: &lt;strong&gt;Event-Driven AI Memory using AWS EventBridge.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Pivot: From Context Windows to a "Neural Memory Bus"
&lt;/h2&gt;

&lt;p&gt;The traditional approach to AI memory is brute force: stuff conversational history into giant, expensive LLM context windows, or build complex Retrieval-Augmented Generation (RAG) pipelines over raw chat logs. &lt;/p&gt;

&lt;p&gt;Both approaches are slow, expensive, and prone to losing important details in the noise.&lt;/p&gt;

&lt;p&gt;Instead of keeping a running transcript of everything the user has ever said, what if we decoupled "memory" from the "chat interface" entirely? What if we treated user actions as asynchronous events?&lt;/p&gt;

&lt;h3&gt;
  
  
  The Architecture: Building the "Fact Store"
&lt;/h3&gt;

&lt;p&gt;We can achieve this by combining &lt;strong&gt;AWS EventBridge&lt;/strong&gt;, &lt;strong&gt;AWS Lambda&lt;/strong&gt;, &lt;strong&gt;Amazon DynamoDB&lt;/strong&gt;, and a hyper fast, cheap LLM like &lt;strong&gt;Claude 3 Haiku via Amazon Bedrock&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;Here is how the event-driven memory pipeline works:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxle0hw68dg5ac264vhcu.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxle0hw68dg5ac264vhcu.gif" alt="hello world" width="600" height="337"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: The Event Bus&lt;/strong&gt;&lt;br&gt;
Route &lt;em&gt;every&lt;/em&gt; user action in your app not just chat messages, but button clicks, page views, and settings changes through AWS EventBridge as standard JSON events. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: The Memory Extractor (Async)&lt;/strong&gt;&lt;br&gt;
Have a lightweight AWS Lambda function subscribe to these events. When an event fires, the Lambda function passes the event payload to a fast, cheap model like Claude Haiku. &lt;/p&gt;

&lt;p&gt;The system prompt is simple: &lt;em&gt;"You are a background observer. Review this user event. Extract any permanent, highly relevant facts about this user. Output as a JSON array. If nothing is relevant, return an empty array."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: The Fact Store (DynamoDB)&lt;/strong&gt;&lt;br&gt;
If Haiku detects a fact (e.g., &lt;em&gt;User is building a SaaS&lt;/em&gt;, &lt;em&gt;User prefers Python&lt;/em&gt;, &lt;em&gt;User operates in the EU&lt;/em&gt;), the Lambda function upserts that key-value pair into an Amazon DynamoDB table keyed by the &lt;code&gt;UserID&lt;/code&gt;. This is your "Fact Store" a living, breathing profile of the user.&lt;/p&gt;




&lt;h2&gt;
  
  
  The "Aha!" Moment: Querying the AI
&lt;/h2&gt;

&lt;p&gt;Now, let's go back to that expensive chat interface. &lt;/p&gt;

&lt;p&gt;When the user asks a complex question, you &lt;strong&gt;do not&lt;/strong&gt; query a massive chat history. You don't pass 80,000 tokens of past transcripts. &lt;/p&gt;

&lt;p&gt;Instead, your backend does a sub-millisecond &lt;code&gt;GetItem&lt;/code&gt; lookup against DynamoDB for that user's Fact Profile. You take those concentrated facts and inject them into the system prompt of your heavy-lifting model (like Claude 3.5 Sonnet or Opus).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjn5xd1ra47v2vcqslgf2.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjn5xd1ra47v2vcqslgf2.gif" alt="secoind image" width="480" height="270"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The CTO’s Reaction: Why This Pattern Wins
&lt;/h2&gt;

&lt;p&gt;When you explain this architecture to engineering leaders, the reaction is almost always the same: &lt;em&gt;"Wait, we can use EventBridge as a global 'neural memory bus' for our AI?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Yes. And here is why this tradeoff makes sense for scaling startups:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Massive Cost Reduction
&lt;/h3&gt;

&lt;p&gt;You are swapping synchronous, high-token inference on your most expensive model for asynchronous, low-token inference on your cheapest model. A 1,000-token prompt to Claude Haiku costs fractions of a cent. Querying a DynamoDB table costs practically nothing. You drop your token consumption by 90%.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Infinite Scale and Speed
&lt;/h3&gt;

&lt;p&gt;DynamoDB delivers single-digit millisecond performance at any scale. Because you are only injecting a condensed JSON object of "Facts" into your final chat prompt, your time-to-first-token (TTFT) drops drastically. The AI responds faster because it has less text to read.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Omnichannel Intelligence
&lt;/h3&gt;

&lt;p&gt;Because the memory is tied to EventBridge not the chat window the AI learns from the user's &lt;em&gt;actions&lt;/em&gt;, not just their words. If a user struggles with a dashboard and triggers three "Error 500" events, the Fact Store updates. When they finally open the support chatbot, the AI already knows they are frustrated and exactly which error they hit.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;We need to stop treating Large Language Models as databases. They are reasoning engines. &lt;/p&gt;

&lt;p&gt;By leveraging standard, highly scalable cloud primitives like AWS EventBridge and DynamoDB, we can offload the burden of memory from the LLM context window into actual infrastructure. &lt;/p&gt;

&lt;p&gt;It feels like AI magic to the user, but under the hood? It’s just brilliant, boring, beautiful engineering.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Have you hit the "context window cost wall" in your generative AI applications yet? Let me know in the comments how your team is managing AI memory at scale.&lt;/em&gt;&lt;/p&gt;




</description>
      <category>aws</category>
      <category>ai</category>
      <category>eventbridge</category>
      <category>serverless</category>
    </item>
    <item>
      <title># Treating Prompts Like Code: Building CI/CD for LLM Workflows on AWS</title>
      <dc:creator>Dhananjay Lakkawar</dc:creator>
      <pubDate>Tue, 24 Mar 2026 14:31:00 +0000</pubDate>
      <link>https://forem.com/dhananjay_lakkawar/-treating-prompts-like-code-building-cicd-for-llm-workflows-on-aws-5gc4</link>
      <guid>https://forem.com/dhananjay_lakkawar/-treating-prompts-like-code-building-cicd-for-llm-workflows-on-aws-5gc4</guid>
      <description>&lt;p&gt;If you look at the codebase of an early-stage AI startup, you will almost always find a file named &lt;code&gt;utils.py&lt;/code&gt; or &lt;code&gt;constants.js&lt;/code&gt; containing massive blocks of hardcoded text. &lt;/p&gt;

&lt;p&gt;These are the LLM system prompts. &lt;/p&gt;

&lt;p&gt;When a model hallucination occurs in production, a developer goes into the code, tweaks a few sentences in the prompt, runs a quick manual test, and pushes the change to production. &lt;/p&gt;

&lt;p&gt;This works for prototypes, but for production systems, &lt;strong&gt;this is a massive operational risk.&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;"Prompt drift" is real. A small change designed to fix an edge case can unintentionally break the formatting, tone, or logic for dozens of other use cases. If you want to build reliable AI systems, you have to stop treating prompts like magical incantations and start treating them like code.&lt;/p&gt;

&lt;p&gt;Here is how a modern engineering team architects an automated, version-controlled CI/CD pipeline for LLM prompts using &lt;strong&gt;GitHub Actions&lt;/strong&gt;, &lt;strong&gt;AWS CodePipeline&lt;/strong&gt;, and &lt;strong&gt;AWS Systems Manager (SSM) Parameter Store&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Core Problem: Tightly Coupled AI
&lt;/h2&gt;

&lt;p&gt;When you hardcode prompts into your application logic (e.g., inside an AWS Lambda function), you tightly couple your application release cycle with your AI tuning cycle. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  To fix a typo in a prompt, you have to redeploy the entire application.&lt;/li&gt;
&lt;li&gt;  You have no historical record of &lt;em&gt;why&lt;/em&gt; a prompt changed and how it affected output quality.&lt;/li&gt;
&lt;li&gt;  You have no automated gate preventing a "bad" prompt from reaching production.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The solution is to decouple the prompt from the code, version it in Git, evaluate it automatically, and inject it at runtime.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Serverless Prompt Pipeline Architecture
&lt;/h2&gt;

&lt;p&gt;To bring engineering rigor to our AI workflows, we need three distinct layers: &lt;strong&gt;Storage&lt;/strong&gt;, &lt;strong&gt;Evaluation&lt;/strong&gt;, and &lt;strong&gt;Runtime Injection&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The Git &amp;amp; Evaluation Flow
&lt;/h3&gt;

&lt;p&gt;Instead of hardcoding strings, developers maintain a &lt;code&gt;prompts.json&lt;/code&gt; or &lt;code&gt;prompts.yaml&lt;/code&gt; file in their repository. When a pull request is opened, it triggers an evaluation pipeline.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp0fphgngyrhorrrvut5w.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp0fphgngyrhorrrvut5w.gif" alt="NAA KAREIN" width="760" height="427"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Runtime Injection (AWS SSM Parameter Store)
&lt;/h3&gt;

&lt;p&gt;Once the CI/CD pipeline validates that the new prompt doesn't break existing functionality, it uses the AWS CLI/SDK to push the updated prompt string into &lt;strong&gt;AWS SSM Parameter Store&lt;/strong&gt; (e.g., under the path &lt;code&gt;/prod/llm/customer_service_prompt&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;When your application (running on AWS Lambda, ECS, or EKS) is invoked, it dynamically fetches the prompt from SSM. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8pn7ielfnvet3fm7s2hp.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8pn7ielfnvet3fm7s2hp.gif" alt="seond flow" width="760" height="427"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The CTO Perspective: Why Architect It This Way?
&lt;/h2&gt;

&lt;p&gt;Building this pipeline requires upfront engineering effort. Here is why it is worth it for scaling teams:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Zero-Downtime Prompt Updates
&lt;/h3&gt;

&lt;p&gt;Because the Lambda function fetches the prompt from SSM at runtime, your product managers or AI engineers can deploy prompt improvements instantly without requiring a full backend deployment or passing through a lengthy code build process. &lt;/p&gt;

&lt;h3&gt;
  
  
  2. Guarding Against Regression
&lt;/h3&gt;

&lt;p&gt;The "Automated Evaluation Gate" is the most critical piece of this architecture. You maintain a "Golden Dataset" of 50-100 real user inputs and expected outputs. &lt;br&gt;
During the CI phase, you run the proposed prompt against this dataset using an "LLM-as-a-judge" pattern. If the new prompt causes the model to start hallucinating or dropping required JSON keys, the pipeline fails the build automatically.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Auditability and Rollbacks
&lt;/h3&gt;

&lt;p&gt;Because SSM Parameter Store supports versioning, you get an automatic audit trail. If Version 14 of your prompt causes issues in production, rolling back is simply a matter of reverting to Version 13 via the AWS Console or CLI.&lt;/p&gt;




&lt;h2&gt;
  
  
  Engineering Tradeoffs &amp;amp; Best Practices
&lt;/h2&gt;

&lt;p&gt;If you implement this architecture tomorrow, keep these real-world constraints in mind:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;SSM API Limits:&lt;/strong&gt; AWS SSM Parameter Store has API rate limits. If you have a high-traffic API (e.g., hundreds of requests per second), fetching the prompt from SSM on &lt;em&gt;every single invocation&lt;/em&gt; will result in &lt;code&gt;ThrottlingException&lt;/code&gt; errors. 

&lt;ul&gt;
&lt;li&gt;  &lt;em&gt;The Fix:&lt;/em&gt; Implement caching inside your Lambda execution environment (e.g., caching the prompt in memory outside the handler function for 5 minutes), or use &lt;strong&gt;AWS AppConfig&lt;/strong&gt;, which is explicitly designed for high-throughput dynamic configuration.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;  &lt;strong&gt;Evaluation Costs:&lt;/strong&gt; Running 100 tests through Claude 3.5 Sonnet on every single Git commit will spike your Amazon Bedrock bill. 

&lt;ul&gt;
&lt;li&gt;  &lt;em&gt;The Fix:&lt;/em&gt; Run the full evaluation suite only on merges to the &lt;code&gt;main&lt;/code&gt; branch, or use a smaller, cheaper model (like Claude 3 Haiku) to run quick sanity checks on feature branches.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;  &lt;strong&gt;String Limits:&lt;/strong&gt; Standard SSM parameters have a 4KB size limit. If you are using massive few-shot prompts with thousands of tokens, you will need to use the &lt;em&gt;Advanced Parameter&lt;/em&gt; tier (up to 8KB) or store the prompt in an S3 bucket and store the S3 URI in SSM.&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;Generative AI is shifting from an experimental feature to a core architectural component of modern applications. If you wouldn't deploy database schema changes without testing and version control, you shouldn't deploy prompt changes without them either.&lt;/p&gt;

&lt;p&gt;By combining GitOps, AWS CodePipeline, and SSM Parameter Store, you bridge the gap between AI experimentation and reliable software engineering. &lt;/p&gt;




&lt;p&gt;&lt;em&gt;How does your team currently manage LLM prompts? Are they hardcoded, stored in a database, or managed via an external tool? Let's discuss in the comments.&lt;/em&gt;&lt;/p&gt;




</description>
      <category>aws</category>
      <category>cicd</category>
      <category>architecture</category>
      <category>ai</category>
    </item>
    <item>
      <title>Routing LLM Traffic on AWS: How to Build a Cost-Optimized Multi-Model API Router</title>
      <dc:creator>Dhananjay Lakkawar</dc:creator>
      <pubDate>Wed, 18 Mar 2026 11:05:06 +0000</pubDate>
      <link>https://forem.com/dhananjay_lakkawar/routing-llm-traffic-on-aws-how-to-build-a-cost-optimized-multi-model-api-router-1lmm</link>
      <guid>https://forem.com/dhananjay_lakkawar/routing-llm-traffic-on-aws-how-to-build-a-cost-optimized-multi-model-api-router-1lmm</guid>
      <description>&lt;p&gt;When engineering teams first integrate Generative AI into their products, they usually make a rational, but ultimately expensive, decision: &lt;strong&gt;they pick the smartest model available and send every single query to it.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Using Claude 3 Opus or GPT-4o for everything is the fastest way to get to market. But as your user base grows, your inference costs will scale linearly or worse, exponentially, if your context windows are expanding.&lt;/p&gt;

&lt;p&gt;The reality of production AI is this: &lt;strong&gt;You don't need a PhD-level reasoning engine to summarize a 3-paragraph email.&lt;/strong&gt; Claude 3 Haiku or Llama 3 can handle 80% of standard production workloads at a fraction of the cost and with much lower latency.&lt;/p&gt;

&lt;p&gt;To protect your startup's runway and optimize your cloud economics, you need to stop hardcoding a single LLM into your backend. Instead, you need to build a &lt;strong&gt;Multi-Model API Router&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Here is how to architect a dynamic LLM router using Amazon API Gateway, AWS Lambda, and Amazon Bedrock to reduce your inference costs by up to 60%.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Concept: Dynamic Prompt Routing
&lt;/h2&gt;

&lt;p&gt;Think of an LLM router like an API load balancer, but instead of routing based on server capacity, it routes based on &lt;strong&gt;cognitive complexity&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When a prompt arrives, a lightweight heuristic evaluates the request. Simple tasks (summarization, formatting, basic entity extraction) slide down a "green pipe" to a fast, cheap model. Complex reasoning tasks (coding, deep analysis, complex multi-step logic) slide down a "purple pipe" to a high-end model.&lt;/p&gt;

&lt;h2&gt;
  
  
  The AWS Architecture
&lt;/h2&gt;

&lt;p&gt;We can build this entirely using primitives on AWS. Because &lt;strong&gt;Amazon Bedrock&lt;/strong&gt; acts as a unified API for multiple foundation models, we don't have to manage different API keys or deal with diverse SDKs for Claude, Llama, or Mistral. Bedrock normalizes the invocation.&lt;/p&gt;

&lt;p&gt;Here is the underlying AWS infrastructure:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc0wgu16pxggh9x24duru.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc0wgu16pxggh9x24duru.gif" alt="Image hfgdertdytf" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Amazon API Gateway (The Entry Point)
&lt;/h3&gt;

&lt;p&gt;We use API Gateway to expose a unified REST or WebSocket API to our front end. The front end doesn't know &lt;em&gt;which&lt;/em&gt; model is being used; it simply sends the payload to &lt;code&gt;/api/v1/generate&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. AWS Lambda (The Routing Engine)
&lt;/h3&gt;

&lt;p&gt;This is where the brain of your application lives. The Lambda function receives the payload and applies a set of routing rules to determine the destination.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Amazon Bedrock (The Execution Layer)
&lt;/h3&gt;

&lt;p&gt;Based on Lambda's decision, it uses the AWS SDK (&lt;code&gt;boto3&lt;/code&gt; in Python or the AWS SDK for Node.js) to invoke the specific Bedrock model ARN.&lt;/p&gt;




&lt;h2&gt;
  
  
  3 Strategies for Building the Router Logic
&lt;/h2&gt;

&lt;p&gt;How exactly does the Lambda function know where to send the prompt? There are three ways to approach this, ranging from simple to advanced.&lt;/p&gt;

&lt;h3&gt;
  
  
  Strategy A: Deterministic Heuristics (Fastest &amp;amp; Cheapest)
&lt;/h3&gt;

&lt;p&gt;You don't always need AI to route AI. You can use standard code logic.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;System Prompts:&lt;/strong&gt; If the user is hitting the "Summarize" button in your UI, your frontend passes a &lt;code&gt;task_type="summarize"&lt;/code&gt; flag. Lambda reads the flag and instantly routes to Haiku.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Token Count:&lt;/strong&gt; If the prompt length is under 500 tokens, send it to a smaller model. If it's a massive 50k-token document, route it to a model with a larger, highly-capable context window like Claude 3.5 Sonnet.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Strategy B: The "LLM-as-a-Judge" Router
&lt;/h3&gt;

&lt;p&gt;For unstructured user inputs (like a chatbot), use a fast, ultra-cheap model (like Haiku) to read the prompt and classify its intent.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;Prompt to Haiku:&lt;/em&gt; "Is the following user request a basic factual question (Return 1) or a complex reasoning task (Return 2)?"&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Lambda reads the &lt;code&gt;1&lt;/code&gt; or &lt;code&gt;2&lt;/code&gt; and routes the &lt;em&gt;actual&lt;/em&gt; query accordingly. (Note: This adds a slight latency overhead, usually ~200-400ms).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Strategy C: The Cascading Fallback (Highest Reliability)
&lt;/h3&gt;

&lt;p&gt;If you want to maximize cost savings while guaranteeing high quality, you implement a &lt;strong&gt;Cascade&lt;/strong&gt;. You send the prompt to a cheap model first. If the cheap model fails, hallucinates, or outputs bad JSON, Lambda catches the error and retries with the expensive model.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9nkrarm9goe1f3u6umua.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9nkrarm9goe1f3u6umua.gif" alt="Image fgdtrdyutf" width="760" height="427"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The CTO Perspective: Tradeoffs to Consider
&lt;/h2&gt;

&lt;p&gt;As a technology strategist, I always emphasize that architectural decisions are about balancing tradeoffs. A Multi-Model Router is not a silver bullet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Latency vs. Cost&lt;/strong&gt; If you use LLM-based routing (Strategy B) or Cascading (Strategy C), you are introducing multiple network hops and inference cycles. For an internal tool or asynchronous data processing, this latency is fine. For a real-time conversational voice bot, adding 500ms of routing latency will ruin the user experience. Choose deterministic heuristics (Strategy A) for real-time apps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Maintenance Complexity&lt;/strong&gt; Prompt engineering is hard enough for one model. When you route across three different models (e.g., Claude, Llama, and Amazon Titan), you must maintain different system prompts optimized for each model's specific quirks. Bedrock's &lt;em&gt;Converse API&lt;/em&gt; makes standardizing the payload easier, but the prompt wording still requires tuning per model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Build vs. Buy&lt;/strong&gt; There are specialized third-party tools (like Portkey or Langfuse) that handle LLM routing as a managed service. However, building this inside AWS via API Gateway and Lambda keeps your data entirely within your VPC and avoids adding another vendor to your billing stack. For most startups, a 150-line Lambda function is perfectly sufficient for the first year of scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;Scaling an AI product doesn't mean your AWS bill has to scale at the exact same rate. By treating LLMs as interchangeable utility endpoints rather than monolithic brains, you can ruthlessly optimize your unit economics.&lt;/p&gt;

&lt;p&gt;Route the heavy lifting to the expensive models, let the cheap models handle the busywork, and let AWS handle the infrastructure.&lt;/p&gt;

&lt;p&gt;The full Lambda implementation with both strategies, the fallback chain, and task type buckets is below copy it, drop it into your Lambda function, wire up API Gateway, and you're routing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://gist.github.com/lakkawardhananjay/1c6e63e7f0ce5b3c672bd88450ec058f" rel="noopener noreferrer"&gt;AWS Lamda Code&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;How is your team handling LLM costs in production? Are you defaulting to the largest models, or have you started implementing routing architectures? Let me know in the comments!&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>aws</category>
      <category>architecture</category>
      <category>serverless</category>
    </item>
    <item>
      <title>Stop Overpaying for VectorDBs: Architecting Serverless RAG on AWS</title>
      <dc:creator>Dhananjay Lakkawar</dc:creator>
      <pubDate>Mon, 16 Mar 2026 13:26:09 +0000</pubDate>
      <link>https://forem.com/dhananjay_lakkawar/stop-overpaying-for-vectordbs-architecting-serverless-rag-on-aws-1pjf</link>
      <guid>https://forem.com/dhananjay_lakkawar/stop-overpaying-for-vectordbs-architecting-serverless-rag-on-aws-1pjf</guid>
      <description>&lt;p&gt;Building a Retrieval-Augmented Generation (RAG) prototype takes a weekend. Taking that prototype to production without burning through your infrastructure budget is a completely different engineering challenge.&lt;/p&gt;

&lt;p&gt;One of the most common pitfalls I see founders and engineering teams fall into is the &lt;strong&gt;Vector Database Cost Trap&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;To get their MVP out the door, teams spin up provisioned vector databases or run dedicated EC2 instances 24/7. It works brilliantly for the first 100 users. But as you scale or worse, when traffic is unpredictable paying for idle compute to keep a vector index in memory becomes a massive drain on your runway.&lt;/p&gt;

&lt;p&gt;If you want to build a highly scalable AI product while protecting your startup's runway, you need to shift from provisioned infrastructure to an event-driven, serverless architecture.&lt;/p&gt;




&lt;h3&gt;
  
  
  The Shift: Serverless RAG
&lt;/h3&gt;

&lt;p&gt;Traditional RAG architecture requires you to provision database nodes, manage cluster scaling, and pay for peak capacity even at 3 AM.&lt;/p&gt;

&lt;p&gt;By moving to a serverless model, we separate the storage of our vectors from the compute required to query them, and we rely on AWS to scale the ingestion and retrieval layers on demand.&lt;/p&gt;

&lt;h4&gt;
  
  
  1. The Ingestion Pipeline
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Trigger (Amazon S3):&lt;/strong&gt; A new document (PDF, TXT, JSON) is dropped into an S3 bucket.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Compute (AWS Lambda):&lt;/strong&gt; An S3 event triggers a Lambda function to chunk the text.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Embedding (Amazon Bedrock):&lt;/strong&gt; Lambda calls Bedrock (e.g., Titan Embeddings) to convert text to vectors.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Indexing (Amazon OpenSearch Serverless):&lt;/strong&gt; Lambda writes the vectors/metadata into an OpenSearch Serverless Vector Search collection.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  2. The Retrieval Flow
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;User Query:&lt;/strong&gt; Arrives via API Gateway.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Embed Query:&lt;/strong&gt; Lambda calls Bedrock to embed the search string.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Similarity Search:&lt;/strong&gt; Lambda queries OpenSearch Serverless (k-NN) to find relevant chunks.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Generation:&lt;/strong&gt; Lambda sends the context + prompt to an LLM (e.g., Claude 3.5 Sonnet) via Bedrock.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Why This Works for Startups
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Zero Infrastructure Management:&lt;/strong&gt; No patching nodes or managing shards.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Event-Driven:&lt;/strong&gt; The pipeline only runs when a document arrives. Zero ingestion = zero cost.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Decoupled Scaling:&lt;/strong&gt; If a user uploads 10,000 documents, Lambda fans out to process them concurrently without impacting search performance.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  A CTO's Perspective: The Economics
&lt;/h3&gt;

&lt;p&gt;You could build your own vector index using &lt;code&gt;pgvector&lt;/code&gt; on RDS. If your dataset is tiny, that works. But if search latency and scale are critical, a dedicated vector engine is necessary.&lt;/p&gt;

&lt;p&gt;With &lt;strong&gt;OpenSearch Serverless&lt;/strong&gt;, AWS recently lowered the minimum capacity to 0.5 OCUs (OpenSearch Compute Units). This brings the base cost of a highly available, scalable vector database down to a startup-friendly level, with the peace of mind that it will auto-scale if your app goes viral.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Tradeoffs (Know Before You Build)
&lt;/h3&gt;

&lt;p&gt;As an architect, I don't believe in silver bullets. Design for these constraints:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Cold Starts:&lt;/strong&gt; If your RAG app requires sub-second latency for the &lt;em&gt;first&lt;/em&gt; request after inactivity, you may need Lambda Provisioned Concurrency.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Scaling Lag:&lt;/strong&gt; OpenSearch Serverless auto-scales, but it isn't instantaneous for massive, sudden spikes. Configure your max OCUs properly and load test your scaling behavior.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Vendor Lock-in:&lt;/strong&gt; You are utilizing AWS primitives. However, because you are using standard frameworks (HTTP requests to Bedrock and standard OpenSearch APIs), migrating your application logic later is feasible.&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  Final Thoughts
&lt;/h3&gt;

&lt;p&gt;The era of overpaying for oversized, underutilized vector databases just to validate an AI product is over. By leveraging Amazon Bedrock, Lambda, and OpenSearch Serverless, you can build an enterprise-grade, event-driven AI architecture from Day 1.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I originally published this on my Hashnode blog: &lt;a href="https://genaiguru.hashnode.dev/stop-overpaying-for-vectordbs-architecting-serverless-rag-on-aws" rel="noopener noreferrer"&gt;HASHNODE_LINK&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Have you made the switch to serverless vector databases yet? Let me know your experience with cold starts and latency in the comments!&lt;/strong&gt;&lt;/p&gt;




</description>
      <category>aws</category>
      <category>startup</category>
      <category>rag</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
