Forem: HatmanStack

Multimodal Rerankers: The Fix for Object Storage RAG

HatmanStack — Thu, 05 Mar 2026 21:09:47 +0000

TL;DR: Filtered HNSW search on object storage has a precision problem that existing solutions can't touch. At small scale, an adaptive boost works. At large scale, multimodal cross-encoders that process images and text through joint cross-attention are the architecture that fixes this.

I've been running RAGStack-Lambda on S3 Vectors with a multimodal corpus, ~60% images with metadata. In my last post, I documented why filtered queries consistently return ~10% lower relevancy, sometimes surfacing the wrong results entirely. The root cause is HNSW graph disconnection from post-filtering compounded by quantization noise in smaller candidate pools.

I solved it at my scale with an adaptive boost that keeps filtered results ~5% above unfiltered, scaling dynamically with how aggressively the filter shrinks the candidate pool. At ~1500 documents, that's enough. This post is about what comes next, not for me, but for anyone building multimodal RAG on object-storage vector databases at scale.

The Object Storage Problem

ACORN (Approximate nearest neighbor Constraint-Optimized Retrieval Network) is the proven solution for filtered HNSW search. When a first-hop neighbor fails the filter, ACORN checks that neighbor's neighbors, a two-hop expansion that keeps the graph connected through filtered-out nodes. It's predicate-agnostic, meaning you don't need to know your filters at index time. Weaviate, Qdrant, and Elasticsearch have all adopted it.

The catch: ACORN assumes the graph is in memory. The two-hop expansion requires cheap random access to neighbor lists. S3 Vectors stores the graph on object storage. Every additional hop is an S3 read, the latency and cost would undermine the entire reason you're using S3 Vectors in the first place.

Text rerankers don't help either. I tried oversampling at 3x and reranking with Cohere Rerank 3.5 via Bedrock. Results got worse, the reranker was evaluating metadata strings like "people: judy wilson, topic: family_photos", not what cross-encoders are trained on. If your corpus is majority images, text rerankers can't see what matters.

So the graph-level fix requires AWS to build something new. I've filed a feature request for filter-aware traversal. But from the application layer, you need a different approach entirely.

Multimodal Cross-Encoders: The Architecture That Fits

While multimodal cross-attention has existed in research for years, over the last few months, a new class of production-ready rerankers has finally made this architecture viable for joint image-text processing at scale. This is architecturally different from both bi-encoders and text cross-encoders.

Bi-encoder (standard retrieval): Images and text are embedded independently at ingestion time. At query time, you're comparing pre-computed vectors with a distance calculation. Fast, cheap, but the query never "sees" the image: it only sees where the image landed in vector space.
Multimodal cross-encoder (reranking): At query time, the candidate image is chunked into patches and passed through a vision encoder to produce visual tokens. The query becomes text tokens. Both are concatenated into a single sequence and fed through a Vision-Language Model with full cross-attention. The query tokens attend directly to spatial features in the image. An MLP head outputs a single relevance score.

The critical difference: the text of your query is directly interacting with the pixels of the candidate image. It's not comparing two independent embeddings: it's reasoning about whether this specific image is relevant to this specific query. This is the piece that was missing when I tried text rerankers.

The Two-Stage Architecture

Stage 1 stays cheap. S3 Vectors does approximate retrieval with all its existing limitations: graph holes, quantization noise, the works. But instead of needing Stage 1 to be precise, you just need it to get the right answer somewhere in a larger candidate set.

Stage 2 is where precision happens. The multimodal cross-encoder actually looks at each candidate image alongside the query text. It doesn't care that S3 Vectors returned a visually similar but wrong photo: it can see the image and read the query, so it can reason about whether this is actually a photo of a rare blue bird in the jungle.

Stage 3 is unchanged. Better candidates in, better generation out.

This architecture decouples retrieval cost from retrieval precision. S3 Vectors gives you the 90% cost reduction on storage. The reranker handles the precision that the graph can't deliver under filtering. You stop asking the vector database to do something its storage medium won't allow.

The Cost Question

A VLM forward pass per candidate is orders of magnitude more expensive than a cosine similarity calculation. If you're oversampling at 5x with a target k of 10, that's 50 VLM inference calls per query on a self-hosted GPU endpoint.

At small scale, this doesn't make sense. The adaptive boost costs nothing and handles the problem well enough. You'd be adding a GPU endpoint to solve a problem that a multiplication operation already addresses.

At large scale, the math inverts. The boost becomes unreliable, the failure modes become invisible, and the cost of wrong results in downstream generation: hallucinations, user trust erosion, bad decisions: exceeds the cost of a reranking endpoint. The GPU cost is also amortized differently at scale: a SageMaker endpoint running inference for thousands of queries per hour is a different proposition than one sitting idle for a dozen queries a day.

The crossover point depends on corpus size, query volume, and tolerance for imprecision. But for anyone building multimodal RAG on object storage at enterprise scale, this architecture is where the industry is heading.

What's Available Now

Three models have appeared in the last few months:

Jina Reranker M0: Built on Qwen-VL. Outputs a scalar relevance score from concatenated query and image/text documents. Open weights.
Llama Nemotron-Rerank-VL: Nvidia's cross-encoder optimized for reranking visual documents against text queries.
Qwen3-VL Reranker: Open-weight model tailored for vision-language reranking pipelines.

None are available through Bedrock yet. They're all self-hosted, which means adding a SageMaker endpoint or GPU-backed instance. Training a custom multimodal reranker for a specific domain is also viable: fine-tune a lightweight VLM with contrastive loss on positive and hard-negative query-image-text pairs from your own corpus.

Where This Leaves Us

The filtered search problem on object storage has three layers:

Graph connectivity: needs an infrastructure fix from AWS. ACORN's approach doesn't transfer to object storage without adaptation. I've filed the feature request.
Score calibration: the adaptive boost handles this now. It keeps filtered results surfacing above unfiltered regardless of selectivity. At small to medium scale, this is the right answer.
Relevance evaluation: multimodal cross-encoders are the first architecture that can actually determine whether an image is relevant to a query, not just whether its vector is close. This is the layer that matters at scale, and the models just arrived.

If you're running multimodal RAG on S3 Vectors today, the adaptive boost is probably sufficient. If you're planning for millions of vectors with filtered search across images and text, the two-stage architecture with a multimodal reranker is the path forward. The pieces exist now: they just haven't been assembled for this specific problem yet.

How to Deploy a Gradio App on AWS — Two Approaches Compared

HatmanStack — Fri, 06 Feb 2026 21:08:27 +0000

Gradio makes it easy to build ML demo interfaces, but deploying them to production is another story. Hosting platforms like HuggingFace Spaces work for prototyping, but when builds start failing due to dependency drift and you need reliability, you need your own infrastructure.

In this tutorial, you'll learn two ways to deploy a Gradio application on AWS:

App Runner — an always-on managed service ($0.12/day)
Lambda with container images — a serverless, pay-per-use approach (pennies per invocation)

Both approaches use real configuration files from a working deployment. By the end, you'll understand the cost and architectural tradeoffs well enough to choose the right one for your project.

Prerequisites

An AWS account
A working Gradio application
Basic familiarity with Docker and AWS services

Option 1: AWS App Runner

App Runner is a managed service for web applications and containers. You point it at a repository or container registry, and it handles scaling, load balancing, and TLS. Most of the configuration lives in an apprunner.yaml file in your repo's root directory:

  version: 1.0
  runtime: python312
  run:
    pre-run:
      - echo "Installing dependencies..."
      - pip3 install --upgrade pip
      - pip3 install -r requirements.txt
    command: python3 app.py
    network:
      port: 7860
      env: APP_PORT
    env:
      - name: GRADIO_SERVER_NAME
        value: "0.0.0.0"
      - name: GRADIO_SERVER_PORT
        value: "7860"
      - name: RATE_LIMIT
        value: "20"
    secrets:
      - name: MY_SECRET
        value-from: "arn:aws:secretsmanager:us-west-2:<your-secret-arn>"

Nothing about your Gradio code changes. The custom configuration lets you specify Python 3.12 and other settings not available through the App Runner console.

App Runner Cost Breakdown

You're billed $0.0064 per vCPU-hour and $0.007 per GB-hour. You can scale down to 0.25 vCPU and 0.5 GB of memory, which works out to roughly $0.12 per day for an always-on service that auto-scales under load.

One thing to remember: grant the Instance Security Role permissions to communicate with other AWS services. If your Gradio app calls Bedrock, Secrets Manager, or S3, you need to add those permissions to the container's security role — not just the deployment role.

Option 2: AWS Lambda with Container Images

While App Runner's always-on cost is reasonable, Lambda would be ideal for applications with bursty or infrequent traffic. Lambda has a 250 MB size limit for combined layers and deployment packages, and aggressively trimming Gradio's dependencies to fit a zip deployment isn't practical.

Instead, you can use a container image with the https://github.com/awslabs/aws-lambda-web-adapter, which lets Lambda run any HTTP application — including Gradio.

You need two files in your repo: a Dockerfile and a buildspec.yaml for CodeBuild.

  The Dockerfile

  FROM public.ecr.aws/docker/library/python:3.12-slim

  WORKDIR /app

  COPY requirements.txt .
  RUN pip install --no-cache-dir --upgrade pip
  RUN pip install --no-cache-dir -r requirements.txt

  COPY . .

The Lambda Web Adapter translates Lambda invocations into HTTP requests

  COPY --from=public.ecr.aws/awsguru/aws-lambda-adapter:0.9.0 /lambda-adapter /opt/extensions/lambda-adapter

  CMD ["python3", "app.py"]

The key line is the COPY --from that pulls in the Lambda Web Adapter. This adapter sits between Lambda's invocation model and your HTTP application, translating Lambda events into standard HTTP requests that Gradio understands.

The CodeBuild Spec

  version: 0.2

  env:
    variables:
      AWS_REGION: "us-west-2"
      AWS_ACCOUNT_ID: "<your-account-id>"
      IMAGE_REPO_NAME: "production/gradio-demo"
      IMAGE_TAG: "latest"

  phases:
    pre_build:
      commands:
        - echo Logging in to Amazon ECR...
        - aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com

    build:
      commands:
        - IMAGE_URI=$AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$IMAGE_REPO_NAME:$IMAGE_TAG
        - docker build -t $IMAGE_URI .

    post_build:
      commands:
        - docker push $IMAGE_URI

Before running the build, create a repository in ECR to store the container image. In CodeBuild, create a new project using an S3 bucket or GitHub as the source, and select an EC2 compute environment (not Lambda compute — Lambda build containers don't include Docker).

Configuring Gradio for Lambda

One critical change: your Gradio app must listen on 0.0.0.0 port 8080 so the Lambda Web Adapter can route traffic to it. Update your launch call:

  server_port = int(os.environ.get("AWS_LAMBDA_HTTP_PORT", 8080))
  demo.launch(server_name="0.0.0.0", server_port=server_port)

Deploying the Lambda Function

In Lambda, create a new function using the container image approach. Select your ECR image and enable a Function URL — that's all you need to get the Gradio app accessible over HTTPS.

Lambda Cost Optimization

AWS recommends running container images with 2048 or 4096 MB of memory, but Gradio typically consumes 125–300 MB during operation. Setting the Lambda function to 512 MB works well and provides a buffer.

Here's how the costs compare:


  ┌────────────────────────────────────────────┬────────────────────┐
  │               Configuration                │        Cost        │
  ├────────────────────────────────────────────┼────────────────────┤
  │ 4096 MB, always-on (EventBridge keep-warm) │ ~$5.76/day         │
  ├────────────────────────────────────────────┼────────────────────┤
  │ 512 MB, always-on (EventBridge keep-warm)  │ ~$0.71/day         │
  ├────────────────────────────────────────────┼────────────────────┤
  │ 512 MB, on-demand (cold starts)            │ ~$0.002/invocation │
  └────────────────────────────────────────────┴────────────────────┘

The tradeoff with on-demand is cold starts — a few extra seconds on the first request. But once the Gradio frontend loads, successive calls just grab container role credentials and are fast.

Handling Cold Starts

The Lambda Web Adapter checks your app's readiness by polling / (the root path). Gradio and FastAPI provide a dedicated health endpoint at /healthz that responds faster during startup. Set this in your Lambda environment variables:

  AWS_LWA_READINESS_CHECK_PATH=/healthz

This reduces the chance of the adapter timing out before your app finishes initializing.

Which Should You Choose?


  ┌──────────────────┬────────────────┬───────────────────────────────────┐
  │      Factor      │   App Runner   │              Lambda               │
  ├──────────────────┼────────────────┼───────────────────────────────────┤
  │ Cost (always-on) │ ~$0.12/day     │ ~$0.71/day (512 MB)               │
  ├──────────────────┼────────────────┼───────────────────────────────────┤
  │ Cost (on-demand) │ Not applicable │ Pennies per invocation            │
  ├──────────────────┼────────────────┼───────────────────────────────────┤
  │ Cold starts      │ None           │ A few seconds                     │
  ├──────────────────┼────────────────┼───────────────────────────────────┤
  │ Scaling          │ Automatic      │ Automatic                         │
  ├──────────────────┼────────────────┼───────────────────────────────────┤
  │ Setup complexity │ Lower          │ Higher (Docker + CodeBuild + ECR) │
  └──────────────────┴────────────────┴───────────────────────────────────┘

Choose App Runner if you want simplicity, consistent response times, and your app gets steady traffic.

Choose Lambda if your traffic is bursty, you want to minimize costs during idle periods, or you're already invested in the serverless ecosystem.

Either service works well for Gradio deployments. The right choice depends on your traffic pattern and how much you care about cold starts versus idle costs.

How to Connect Google Forms to Snowflake Using Cloud Run

HatmanStack — Fri, 06 Feb 2026 20:58:25 +0000

Every major cloud provider has tools to collect survey data within their own ecosystem. But what if you need form responses to land in a data warehouse on a different platform? That takes a little
wiring.

In this tutorial, you'll build a pipeline that automatically sends Google Forms responses to a Snowflake table. The architecture uses four components:

Google Forms — collects the data
Apps Script — triggers on each form submission
Google Cloud Run — runs a Node.js service that connects to Snowflake
Snowflake Node Connector — inserts the data using parameterized queries

By the end, every form submission will automatically appear in your Snowflake warehouse within a couple of minutes.

Prerequisites

A Google Cloud account with billing enabled
A Snowflake account (a free trial works)
Basic familiarity with Node.js and Docker

Step 1: Set Up the Google Form and Apps Script

Create a Google Form and navigate to the Responses tab. Click the Google Sheets icon to create a linked spreadsheet — this is where form responses land before we forward them to Snowflake.

Open the linked spreadsheet, then go to Extensions → Apps Script. Create a function that runs on each form submission:

  function myFunction() {
    var ss = SpreadsheetApp.getActiveSpreadsheet();
    var sheet = ss.getSheets()[0];
    var range = sheet.getRange("A2:E");
    range.sort({column: 1, ascending: false});

    var url = ""; // We'll fill this in after deploying Cloud Run
    var headers = {
      "contentType": "application/json",
      "headers": {
        "X-PW-AccessToken": "<TOKEN>",
        "X-PW-Application": "developer_api",
        "X-PW-UserEmail": "<YOUR_EMAIL>"
      }
    };
    UrlFetchApp.fetch(url, headers);
  }

This script sorts entries by timestamp (newest first), then makes an HTTP request to our Cloud Run service. Leave the url empty for now — we'll fill it in after deploying the connector.

Step 2: Create a Google Cloud Service Account

Service accounts let applications authenticate with Google APIs without using a personal login. Our Cloud Run service needs one to read from the Google Sheet.

In Google Cloud Console, go to IAM & Admin → Service Accounts → Create Service Account
Save the generated email address — you'll need it later to share the Google Sheet
Go to the Keys tab → Add Key → JSON. This downloads a credential file to your machine
Enable the Google Sheets API in your project (APIs & Services → Enable APIs)

Keep the downloaded JSON file — we'll include it in our Cloud Run deployment as creds.json.

Step 3: Prepare Your Snowflake Table

Before the connector can insert data, the destination table must exist. In Snowflake, run:

  CREATE DATABASE IF NOT EXISTS DEMO_DB;
  CREATE SCHEMA IF NOT EXISTS DEMO_DB.PUBLIC;
  CREATE TABLE IF NOT EXISTS DEMO_DB.PUBLIC.SHEETS (
    TS STRING,
    NAME STRING,
    DAYS STRING,
    DIET STRING,
    PAY STRING
  );

Adjust the column names and types to match your form's fields.

Step 4: Build the Node.js Connector

The connector is a small Express server that reads the latest form entry from Google Sheets and inserts it into Snowflake. Create three files:

  index.js

  const path = require('path');
  const { google } = require('googleapis');
  const sheets = google.sheets('v4');
  const snow = require('snowflake-sdk');
  const express = require('express');
  const app = express();

  const getInvite = async () => {
    const auth = new google.auth.GoogleAuth({
      keyFile: path.join(__dirname, 'creds.json'),
      scopes: ['https://www.googleapis.com/auth/spreadsheets'],
    });
    google.options({ auth });

    const res = await sheets.spreadsheets.values.get({
      spreadsheetId: '<your-sheets-id>', // From the sheet URL
      range: 'A2:E2',
    });

    const row = res.data.values;
    if (!row || row.length === 0) {
      console.log('No data found.');
      return;
    }

    const connection = snow.createConnection({
      account: '<locator>.<cloud-provider>', // e.g., xh45729.us-east-2.aws
      username: '<your-username>',
      password: '<your-password>',
      warehouse: 'COMPUTE_WH',
      database: 'DEMO_DB',
      schema: 'PUBLIC',
      role: 'ACCOUNTADMIN'
    });

    const conn = connection.connect();

    // Parameterized query prevents SQL injection — never concatenate
    // user-supplied values directly into SQL strings
    conn.execute({
      sqlText: 'INSERT INTO DEMO_DB.PUBLIC.SHEETS (TS, NAME, DAYS, DIET, PAY) VALUES (?, ?, ?, ?, ?)',
      binds: row[0]
    });
  };

  const port = process.env.PORT || 8080;
  app.listen(port, () => {
    console.log(`Listening on port ${port}`);
  });

  app.get('/', (req, res) => {
    getInvite();
    res.send('Adding Data');
  });

Your Google Sheet ID is the long string in the sheet's URL between /d/ and /edit:

https://docs.google.com/spreadsheets/d/<sheets-id>/edit#gid=0

The Snowflake account identifier (.) is in the bottom-left of your Snowflake console, or in the login URL. For example: xh45729.us-east-2.aws. Check the https://docs.snowflake.com/en/user-guide/admin-account-identifier if you're unsure — the format varies by deployment.

Notice the parameterized query with ? placeholders and the binds array. This is important: parameterized queries prevent SQL injection by letting the database driver handle escaping. Never build SQL strings by concatenating user input directly.

  Dockerfile

  FROM node:20-slim

  WORKDIR /usr/src/app

  # Copy dependency manifests first — Docker caches this layer
  # so npm install only re-runs when dependencies change
  COPY package*.json ./

  RUN npm install --production

  COPY . .

  EXPOSE 8080
  CMD ["node", "index.js"]

  package.json

  {
    "name": "node-sheets-to-snow",
    "version": "2.0.0",
    "description": "Google Sheets to Snowflake connector",
    "main": "index.js",
    "scripts": {
      "start": "node index.js"
    },
    "engines": {
      "node": ">=20.0.0"
    },
    "dependencies": {
      "express": "^4.21.0",
      "googleapis": "^144.0.0",
      "snowflake-sdk": "^2.0.2"
    }
  }

Step 5: Deploy to Google Cloud Run

Place all four files (index.js, Dockerfile, package.json, creds.json) in a directory. Open Cloud Shell from the Google Cloud Console (the terminal icon in the top right), upload the files, and deploy:

  cd your-project-directory
  gcloud run deploy --source .

Cloud Run will ask for a service name and region. It builds the container, deploys it, and returns a URL. Copy that URL.

Step 6: Wire Everything Together

Two final connections:

Share the Google Sheet with your service account. Click the Share button on the spreadsheet and add the service account email from Step 2. This gives the Cloud Run service permission to read form responses.
Add the Cloud Run URL to your Apps Script. Go back to Extensions → Apps Script and paste the URL into the url variable in your function.

Testing It

Submit a response through your Google Form. Within one to two minutes, the data should appear in your Snowflake table. The slight delay comes from Cloud Run's cold start — the container spins down when idle and takes a moment to restart on the first request.

You can verify by running in Snowflake:

  SELECT * FROM DEMO_DB.PUBLIC.SHEETS ORDER BY TS DESC LIMIT 5;

The complete source code is available on https://github.com/HatmanStack/snow-node-sheets-gpc.

Why Filtered Queries Return Lower Relevancy in S3 Vectors (And What To Do About It)

HatmanStack — Fri, 30 Jan 2026 00:15:03 +0000

TL;DR: The ~10% relevancy drop on filtered S3 Vector queries isn't a bug — it's quantization noise plus graph disconnection from post-filtering. Boost or re-rank to fix it.

I've been running RAGStack-Lambda with ~1500 documents in a knowledge base. After revamping my metadata for S3 Vectors, something weird happened: filtered queries started returning the wrong results. I'd search for a specific person with explicit filters and get back a picture of a different person. The visual similarity was overpowering my metadata filters.

After digging in, I found filtered results consistently score ~10% lower in relevancy than unfiltered queries — even for the same content. This isn't a bug. It's a predictable consequence of how S3 Vectors is architected.

The Trade-Off You're Making

S3 Vectors can cut your vector database costs by 90%. A billion vectors runs ~$46/month versus $660+ on Pinecone. The catch? You're trading precision for price.

Two mechanisms cause the relevancy drop:

1. Quantization Noise

S3 Vectors uses aggressive 4-bit Product Quantization to compress vectors — shrinking them by 64x so they can live on object storage instead of RAM.

Unfiltered search: With millions of candidates, the sheer volume drowns out the approximation error. Strong matches still surface.

Filtered search: Your candidate pool shrinks. The algorithm evaluates vectors that are further away in the space. Suddenly that quantization error is a significant chunk of your distance calculation. The ~10% drop corresponds to the noise floor of 4-bit quantization.

2. The Disconnected Graph Problem

S3 Vectors uses HNSW (Hierarchical Navigable Small World) — a graph where vectors connect to their neighbors. Search works by traversing edges to find the nearest match.

When you filter, you're turning off nodes. Remove 90% of vectors and you create holes in the graph. The search algorithm gets trapped — the "bridge" edges to better regions have been filtered out. It settles for local minima instead of finding your actual best match.

This is why I was getting the wrong person's photo. Visually similar, passed the traversal, but wrong.

The Fix (What I Thought)

Short-term: A 1.25x boost for filtered results normalized my scores. Crude but effective.

Long-term: Re-ranking. Oversample (request 3-5x your target k), then re-rank with a cross-encoder or Bedrock’s rerank API. Use S3 Vectors for cheap retrieval and smarter compute for precision.

The Fix (What Actually Worked)

I spent time implementing the ”sophisticated” solution. Oversample filtered results at 3x, rerank with Cohere Rerank 3.5 via Bedrock, merge the slices fairly.

Results got worse.

The problem: rerankers are designed for text documents. My knowledge base is ~60% images with metadata. The reranker was evaluating synthesized text like “people: judy wilson, topic: family_photos” — not what cross-encoders are optimized for.

Meanwhile, the raw vector similarity scores from visual embeddings were actually good relevance signals. I was replacing useful information with noise.

I tried reranking both filtered and unfiltered slices. I tried reranking only filtered. I tried dropping visual-only results from unfiltered. Each ”improvement” made things worse or traded one problem for another.

The 1.25x boost I dismissed as “crude”? It’s still running in production.

RAGStack-Lambda: Scale-to-Zero RAG with Multimodal Search

HatmanStack — Fri, 23 Jan 2026 20:48:40 +0000

Most RAG architectures charge you $300+/month for vector databases that run whether you're querying or not. RAGStack-Lambda scales to zero. $7-10/month for 1,000 documents.

The trick is S3 Vectors + Lambda + Bedrock. You trade sub-50ms latency for hundreds of milliseconds. For chat interfaces and document Q&A, that's fine.

Beyond Text Search

Amazon Nova embeddings put text, images, and video frames in the same vector space. Upload a photo, search with natural language, get semantically relevant results.

For video: frames get visual embeddings and audio gets transcribed into 30-second chunks with speaker identification. Every chunk carries timestamp metadata. Query by what's said or what's shown — citations link directly to that segment.

Smarter Retrieval

RAGStack doesn't just embed your content. It analyzes it.

Metadata extraction examines each document and pulls structured fields automatically — topic, document type, date range, whatever's relevant.

Filter generation samples your knowledge base and creates few-shot examples based on what it finds. No manual curation.

Multi-slice queries run parallel retrievals using those generated filters. Instead of one broad search, you get multiple targeted queries returning more relevant results.

The Stack

One-click AWS Marketplace deployment
Framework-agnostic web component (one script tag)
MCP server for Claude Desktop, Cursor, VS Code
Everything runs in your account — no external control plane

GitHub | Demo | Blog