Forem: Aniket Hingane

Resilient Guest-Policy Retrieval: A Self-Healing Semantic Loop for Hotel Context

Aniket Hingane — Tue, 07 Apr 2026 02:09:19 +0000

How I Recovered Weak Matches with Controlled Expansion and Bundled Evidence in a Solo PoC

TL;DR

This write-up documents a personal experiment I ran while thinking about how guests actually ask questions at the front desk, on the phone, or in a chat widget. The phrasing is rarely canonical. Someone might say they want a rubdown later today when the policy text says massage appointments and cancellation windows. Another guest might describe late checkout as staying past the morning rush. If you treat every utterance as a perfect keyword match, you will look clever in a slide deck and brittle in real language. I built a small Python system that embeds synthetic hotel policy chunks with a compact sentence transformer, measures cosine similarity against guest questions, and applies a narrow healing loop when the score falls below a floor I set by hand. The loop tries controlled synonym expansion first, then merges the top passages into a bundled evidence string when the model still hesitates. I also track synthetic staleness days on each chunk so the PoC can pretend that some documents deserve an offline review queue. Nothing here runs in production, nothing here connects to a property I have worked with, and nothing here should be read as advice from a vendor. I am describing an exploratory solo build because that is what it is. The code lives in a public repository so anyone can inspect the assumptions without asking me to narrate them from memory. If you take one idea away, take this one: I cared more about making failure visible and recoverable than about chasing the highest possible retrieval score on a toy corpus. Transparency beat vanity in my priorities for this project.

Introduction

Hospitality guest operations sit at an interesting intersection of empathy and procedure. Guests want speed and clarity. Operators want consistency and traceability. Tools in the middle often promise both and deliver neither if they hide how an answer was assembled. I have been thinking about that tension while experimenting with retrieval stacks that admit their own uncertainty. This article is my attempt to write down what I tried, what I measured informally, and where I stopped on purpose.

Before I describe the modules, I want to anchor the posture of this article. I am writing as an individual who builds experiments in public to learn, not as someone who is reporting on a deployed guest assistant or claiming validated operational outcomes. Hospitality is easy to romanticize and easy to misrepresent. I have watched people ask oddly phrased questions in lobbies and elevators, not as a researcher with formal instruments but as someone who notices wording when I travel. The questions are rarely perfect. A person might compress three concerns into one sentence about noise, timing, and fairness. If you flatten that into a single embedding and hope for the best, you can still retrieve plausible text, but you lose the story of why the match was weak in the first place. That loss matters to me as an engineer because I like systems that admit uncertainty instead of laundering it behind fluent text.

In my PoC I chose a hotel guest-operations framing because it is relatable, easy to illustrate with synthetic documents, and distinct from other domains I have written about recently in this personal series. I am not describing revenue management, banquet sales, or accounting. I am staying with a narrow slice: how a policy-grounded assistant might assemble evidence before any human writes a guest-facing sentence. I also want to be clear about the human layer. Staff use judgment. A system that pretends to replace that judgment with a single score is not something I would defend. What I am experimenting with is a structured scaffold that keeps evidence visible so a human can still override.

I wrote this article because I wanted a serious project that still fits on a laptop. I have seen enough demos where a model produces fluent language and hides the underlying evidence. I wanted the opposite. I wanted logs that read like engineering notes, not marketing copy. The code prints healing actions such as none, synonym_expand, or context_merge, and it writes a small matplotlib chart so I can see whether the batch run skewed toward one outcome because of a bug or because of the wording of my synthetic questions.

There is another motivation I should state plainly. I am interested in practices that survive contact with messy language. People shorten words, omit nouns, and rely on context. They say the morning rush thing instead of spelling out late checkout. Any retrieval system that assumes the query already contains canonical terms will fail in ways that look embarrassing on a demo but painful in real life. I did not solve that fully here. I only created a place to talk about it honestly while still writing code.

I also want readers to know the scope boundary I used while writing. This article discusses a synthetic dataset and illustrative thresholds. It does not describe any real hotel brand, franchise agreement, or property staffing model. If a phrase resembles language you have seen in the wild, that is because operational writing converges on similar vocabulary, not because I copied private material.

A note on language and tone

I chose neutral, procedural wording for the synthetic policies on purpose. I did not want sensational examples that read like a thriller. Real front-desk life already carries enough stress without my demo adding theatrical conflict. I also avoided idioms that only make sense in one region. The point was to keep the text boring enough that retrieval mistakes are visible instead of being masked by narrative drama.

How this article relates to my other experiments

I have written about routing and retrieval in other contexts. This piece is different because the healing loop is the protagonist. I am not showcasing a multi-agent cast. I am showcasing a measurement-and-repair cycle that could exist inside many larger systems. If you have read my earlier write-ups, you might recognize my preference for logs over slogans. That preference shows up again here in how I print healing actions and scores.

What's This Article About?

The article walks through GuestResilience-HotelContext-AI, a Python project that embeds short policy chunks, scores guest questions with cosine similarity, and applies a healing loop when confidence falls below a configurable floor.
I explain why I combined semantic retrieval with a hand-built synonym map rather than relying on either signal alone in isolation.
I show how the batch table and matplotlib chart help me see whether the demo skews toward synonym expansion because of the synonym list or because of the embedding geometry on a tiny corpus.
I discuss limitations honestly: miniature corpora, heuristic floors, and a merge score that is not a calibrated probability.
I include a code walkthrough that mirrors how I read the repository myself when I return to it after a gap.

Tech Stack

Runtime expectations on a laptop

I tested this PoC on a recent Mac laptop with a normal consumer CPU. Inference time for a batch of half a dozen questions is small enough that I did not bother printing millisecond timings in the CLI. If you run on older hardware, the first embedding pass over the chunks might take longer, but it remains a one-time cost per process start. I mention hardware because retrieval demos often silently assume a GPU. I did not require a GPU for this code path.

The implementation is intentionally straightforward on purpose. I rely on Python 3.10 or newer, NumPy, scikit-learn largely as a transitive dependency, sentence-transformers with the all-MiniLM-L6-v2 model for normalized embeddings, matplotlib for a bar chart, and Rich for readable terminal tables. There is no hosted vector database and no cloud requirement for the retrieval math itself. The entire index fits in memory because I refused to pretend this PoC is big data.

From where I stand, that stack is enough to demonstrate the idea that resilience in the small can be practiced with transparent steps when the corpus is tiny and the goal is structured evidence rather than open-ended generation. If I later swap MiniLM for another encoder, the interfaces around chunking and healing remain stable, which was a design goal while I sketched the modules.

Why Read It?

If you are evaluating how to structure pre-model logic for operational assistants, this article offers a concrete pattern: measure confidence, attempt a deterministic repair, then widen evidence before you give up.
If you are learning sentence-transformers with cosine similarity, the retrieval module is short and testable.
If you care about reproducibility, the orchestrator gives you a baseline against which any future learned rewriter can be compared.
I think the read is most useful for practitioners who want a middle ground between pure neural retrieval and pure rules, because the code shows exactly where those worlds meet in my PoC.

There is also a pedagogical angle I care about. Many tutorials jump straight to large language models for every turn without establishing a measurement story. I am not anti-LLM; I use them elsewhere. But I believe beginners should see cosine similarity on explicit vectors at least once, because it demystifies what nearest neighbor means in code rather than in marketing language.

Finally, if you maintain open-source examples, you know the burden of dependencies. I kept the stack bounded so a reader in a constrained environment can still run the demo after accepting the one-time model download.

Let's Design

Framing the problem without overfitting the story

Before touching code, I spent time writing short synthetic guest questions on paper. I noticed recurring patterns: some messages emphasize time pressure early, others bury the actionable detail in the second half, and a few mix wellness language with policy language. I did not try to split multi-intent questions into multiple tickets in this repository. Instead, I focused on a single-text input so the healing loop stays easy to reason about. That choice trades realism for clarity, and I am comfortable stating that upfront.

Why semantic similarity plus a synonym map

The design starts from an observation I kept returning to while prototyping: informal words are not interchangeable with policy words, yet they often co-occur in real speech. A guest might say rubdown while the document says massage. The embedding model sometimes closes that gap on its own, and sometimes it does not. I did not want a black box rewrite of the query. I wanted a controlled expansion list that I can audit, prune, and argue about in a code review with myself.

The retrieval layer builds a normalized embedding matrix for every chunk. For each query, I compute cosine similarity as a dot product because the vectors are unit length. I take the top five for debugging, but the orchestration decision hinges on the best score versus a floor constant defined in RetrievalConfig.

The healing step as a confidence gate

I use the word healing in a narrow sense. There is no autonomous agent calling external tools without bounds. I am referring to a decision step that tries synonym expansion when the first pass looks weak, then merges the top passages when the expanded query still fails to clear the floor. That is not deep reasoning. It is a guardrail with two rungs. I still call it healing in the sense that the system attempts to repair a weak match before it declares the evidence unreliable.

If I were to extend this experiment, I would log the floor crossings and measure how often expansion helps relative to merge. In this PoC, I only observe the behavior in the console and in the chart.

Observability as a first-class requirement

I insisted on ASCII-friendly batch output because I wanted copy-pasteable logs for my own notes. Rich tables are not strictly necessary, but they make the first screen readable when I am tired. The matplotlib chart is a concession to the human visual system. Even a simple bar chart changes how I perceive imbalance across healing actions.

Ethics and guest-facing tone

I thought carefully about how synthetic hospitality language can still carry real emotional weight for readers. I avoided sensational scenarios. I kept policies dull on purpose because dull policy text is what operational systems ingest. I also avoided implying that this PoC could triage safety-critical incidents. It cannot. It is a toy corpus with toy thresholds.

Let's Get Cooking

The public repository is here: https://github.com/aniket-work/GuestResilience-HotelContext-AI

I will highlight three slices of the code that capture the spirit of the build: configuration boundaries, the healing orchestration, and the batch entry point.

Configuration as a contract with future me

I centralized thresholds and the synonym map in one module so I cannot pretend magic numbers appeared from nowhere. The floor is a single float. The synonym map is a dictionary from informal triggers to extra tokens that nudge the embedding toward policy vocabulary.

from dataclasses import dataclass


@dataclass(frozen=True)
class RetrievalConfig:
    """Tunable thresholds for the PoC; not tuned on production traffic."""

    similarity_floor: float = 0.45
    staleness_warning_days: int = 45
    model_name: str = "sentence-transformers/all-MiniLM-L6-v2"


SYNONYM_EXPANSIONS: dict[str, tuple[str, ...]] = {
    "pool": ("swimming", "aquatics", "lap pool", "towels"),
    "spa": ("massage", "wellness", "studio", "appointment"),
    "rubdown": ("massage", "wellness", "studio", "appointment", "cancellation"),
    "checkout": ("late checkout", "11:00", "2:00", "availability"),
    "morning": ("late checkout", "11:00", "2:00", "availability", "rush"),
    "loud": ("quiet hours", "complaints", "noise", "neighbors"),
    "tesla": ("EV", "charging", "garage", "kilowatt", "overnight"),
    "clean": ("housekeeping", "towels", "privacy mode", "tablet", "nights"),
}

I wrote the dataclass as frozen because I wanted the configuration object to behave like a value I pass around without accidental mutation during a late-night edit. In my opinion, small immutability choices reduce self-inflicted bugs in solo projects too. The synonym map is deliberately limited. I did not try to learn it from data in this repository because I wanted an honest baseline I could explain without pointing at a training pipeline I do not own.

Healing orchestration: measure, expand, merge

The orchestration function encodes the query, compares against the floor, optionally expands the query text when informal keywords appear, and finally merges the top passages if the system still cannot climb above the floor. The merge path constructs a synthetic chunk identifier so the log shows that the evidence is bundled rather than a single canonical policy paragraph.

def self_healing_retrieve(index, raw_query, cfg):
    query_used = raw_query
    qvec = encode_query(raw_query, cfg)
    ranked = index.top_k(qvec, k=5)
    best_chunk, best_score = ranked[0]

    if best_score >= cfg.similarity_floor:
        return HealingResult(
            query_used=query_used,
            best_chunk=best_chunk,
            best_score=best_score,
            healing_action="none",
            detail="Retrieval above similarity floor.",
        )

    expanded = _expand_query(raw_query)
    if expanded is not None:
        evec = encode_query(expanded, cfg)
        ranked_e = index.top_k(evec, k=5)
        ec, es = ranked_e[0]
        if es > best_score:
            best_chunk, best_score, query_used = ec, es, expanded
            if best_score >= cfg.similarity_floor:
                return HealingResult(
                    query_used=query_used,
                    best_chunk=best_chunk,
                    best_score=best_score,
                    healing_action="synonym_expand",
                    detail="Synonym expansion recovered a confident match.",
                )
        ranked_for_merge = ranked_e
    else:
        ranked_for_merge = ranked

    merged_chunk, merged_score = _merge_context(ranked_for_merge)
    if merged_score >= best_score:
        return HealingResult(
            query_used=query_used,
            best_chunk=merged_chunk,
            best_score=merged_score,
            healing_action="context_merge",
            detail="Merged top passages after low single-chunk confidence.",
        )

    return HealingResult(
        query_used=query_used,
        best_chunk=best_chunk,
        best_score=best_score,
        healing_action="staleness_flag",
        detail="Healing did not lift score above prior best; flagged for offline refresh.",
    )

When I read this function cold, I look for three things: whether I accidentally reuse the wrong ranked list after expansion, whether the merge path can invent a score that looks comparable to cosine similarity, and whether the failure mode still returns something inspectable. The merge score is an average of the top similarities. That is not a probability. It is a crude signal so the PoC can prefer a wider bundle over a single weak chunk.

Entry point: batch questions and a chart

The main module loads chunks, embeds them once, runs the first query with a detailed table, then iterates a batch list and records healing actions for plotting.

def main() -> None:
    console = Console(width=118)
    cfg = RetrievalConfig()
    chunks = load_chunks(ROOT)
    texts = [c.text for c in chunks]
    matrix = encode_texts(texts, cfg)
    index = ChunkIndex(chunks, matrix)

    first = _demo_queries()[0]
    res0 = self_healing_retrieve(index, first[2], cfg)
    print_single_result(console, res0)

    rows: list[tuple[str, str, float, str]] = []
    actions: list[str] = []

    for qid, label, text in _demo_queries():
        r = self_healing_retrieve(index, text, cfg)
        actions.append(r.healing_action)
        rows.append((qid, r.healing_action, r.best_score, label))

    print_batch_ascii(console, rows)

    out_png = ROOT / "output" / "healing_actions.png"
    plot_healing_actions(actions, out_png)

I structured the demo queries to span direct pool language, slang wellness language, vague checkout language, noisy neighbor language, EV parking language, and opaque housekeeping language. In my experience, that spread is enough to stress the synonym map without pretending the corpus is comprehensive.

What I rejected along the way

I considered a few alternatives before settling on MiniLM plus a manual synonym map for the first public cut. A cross-encoder reranker would likely improve ordering on ambiguous pairs, but it would also double the inference story and tempt me to hide mistakes behind a second model without a clean measurement layer. I decided that demonstrating a two-stage neural stack was not the point of this repository. The point was to show a transparent loop.

I also thought about BM25 as a lexical backstop. It is a strong baseline for short documents and behaves well when the vocabulary overlap is explicit. On a handful of ten-chunk snippets, the difference between BM25 and TF-IDF style signals is often swamped by the fact that the corpus is tiny. I stayed with dense embeddings because the guest language problem I care about is often semantic drift rather than spelling. I can imagine a hybrid later: dense retrieval for the first pass, BM25 for dispute resolution when two chunks tie.

Vector store and why I kept it in memory

The ChunkIndex class is intentionally boring. It stores a matrix of normalized embeddings and a parallel list of PolicyChunk objects. Search is a matrix-vector product followed by sorting. I did not use an approximate nearest neighbor index because the row count is ten. Bringing in HNSW or IVF would be theater. I would rather write code that a reader can grep in a single file than pretend this PoC needs a billion-scale index.

Staleness as a narrative device

Each synthetic chunk carries a staleness_days integer. I use it as a narrative device in the detail string when a chunk crosses a warning threshold. I am not running a real document management system. I am simulating the feeling of operations where a PDF might have been updated last quarter while the embedding still reflects old text. If I ever wire this to a real ingestion pipeline, staleness should come from a database, not a JSON field I edited by hand.

Theory in plain language: what cosine similarity is doing here

When I say cosine similarity, I mean the dot product between two unit vectors. The sentence-transformers library can emit normalized embeddings, which turns cosine similarity into a single dot product without a separate magnitude step. That is convenient, but it also means I am trusting the encoder to place paraphrases near each other in angular space. On small corpora, the geometry can be surprisingly sharp. Two chunks that look similar to me might sit far apart because the model latched onto different function words.

I spent time thinking about what a score of 0.45 means versus 0.65. In a calibrated probabilistic system, those numbers would come with a story about calibration. Here, they are relative ranks with a threshold I set manually. I want to be explicit about that because it is easy to reify cosine scores as confidence when they are not. In my PoC, the score is a compass needle, not a verdict.

Edge cases that kept me honest

If a guest uses a phrase that matches multiple synonym keys, the expansion step concatenates extra tokens. That can help or hurt. More tokens add noise. I mitigated this by keeping the synonym tuples short and topic-aligned.
If the first retrieval is already above the floor, I do not attempt healing. That is intentional. I did not want a system that always second-guesses a strong match.
If expansion fails and merge still looks weak, I fall back to staleness_flag. That label is intentionally unsatisfying. It is a reminder that some queries need a human or a richer corpus, not another heuristic.

Personal workflow notes from building solo

I kept a paper notebook beside the keyboard while I wrote the synthetic questions. That sounds quaint, but it slowed me down in a useful way. When I type questions directly into code, I optimize for short strings. When I write them on paper, I leave in awkwardness. Awkwardness is the point. I also tracked my own confusion: if I could not remember why a threshold existed a week later, I renamed a variable or added a comment. I am not claiming perfect documentation. I am claiming that solo work still benefits from a future reader, and that future reader is often me.

Let's Setup

Step-by-step details can be found in the repository README. At a high level, I create a virtual environment inside the project directory, install requirements, and run python main.py. The first execution downloads the sentence-transformers weights, which is the longest step. I prefer keeping the virtual environment local to the project so the PoC stays self-contained when I archive it months later.

Deeper code walkthrough: embedder and index

The embedder module memoizes the SentenceTransformer model so repeated runs do not reload weights. I encode all chunk texts once at startup, then reuse the matrix for every query. That is standard batching discipline, but it matters when I iterate on questions because the expensive work stays amortized.

The vector store computes similarity as a dot product between the query vector and each row of the matrix. I take the top five for debugging even though decisions only need the top one. I do that because I want to inspect near-misses when a score looks wrong. In my experience, the second-best chunk often explains why the first-best chunk is misleading.

Roadmap I would pursue if this stayed a hobby

Add a small evaluation harness with paraphrased questions and a simple precision-at-one metric.
Swap the manual synonym map for a learned sparse expansion that I can still audit, or for a curated ontology from a domain I own.
Introduce an explicit human-approval flag in the output for any evidence bundle that includes merged chunks.
Explore a lightweight reranker only after the measurement harness exists, because I refuse to stack models without a baseline.

Reflections on reliability and guest trust

Reliability is not only a technical score. It is also the feeling a staff member gets when they read the evidence. If the evidence is verbose, contradictory, or obviously stitched, trust drops even when a cosine score is high. I thought about that while designing the merge path. Bundling top passages is a blunt instrument. It increases recall at the cost of readability. In a production setting, I would want a summarization step that cites chunk identifiers, and I would want those identifiers to map back to a source document version. None of that exists here. I am describing what I would do next, not what I shipped.

Let's Run

When the script finishes, I expect a Rich table for the first query, an ASCII batch summary, and a matplotlib file under output/healing_actions.png. I treat that chart as a sanity check. If every bar lands in one category, I suspect a bug or a threshold that is too aggressive.

I usually run the script twice in a row during development. The first run pays the model download cost if the cache is cold. The second run is the one I use to compare output after a code change, because it removes network noise from the picture. That habit saved me from chasing ghosts more than once when I thought my logic changed scores but the real difference was initialization time.

What I would measure if I turned this into a longer study

I am not running a formal benchmark in this repository. I want to be explicit about that gap because benchmarks are where retrieval claims go to become honest or fall apart. If I had another month of evenings, I would build a small labeled set of question-and-chunk pairs derived from the same synthetic corpus, then sweep the similarity floor and record how often expansion or merge changes the top chunk. I would also measure latency on a cold start versus a warm start, because the sentence-transformers download is the kind of friction that changes whether a demo feels credible in a conference room with spotty Wi-Fi.

I would also track how often merged bundles confuse a human reader. That is a qualitative metric, but it matters. A merge that improves cosine similarity but produces a wall of text is not a win if the goal is staff confidence. In my opinion, human readability should be a first-class metric alongside rank metrics, even if it is harder to automate.

Why I did not ship a full chatbot wrapper

A full chatbot would need session management, safety filters, and a clear escalation path. Those layers are important, but they would dilute the retrieval story I wanted to tell. I kept the surface area small on purpose. The CLI prints evidence. That is enough for me to judge whether the retrieval layer is behaving, and it is enough for a reader to fork the repository without inheriting a web stack I do not want to maintain as a solo author.

Dependencies, downloads, and the social contract of open weights

I rely on publicly available weights. That choice carries a social contract: read the license, respect attribution, and do not pretend the model is neutral truth. I also accept the reality that first-time downloads can fail for reasons outside my code. I mention this because newcomers sometimes blame their own competence when the network hiccups during a Hugging Face download. If that happened to you while reproducing my PoC, retrying with a stable connection usually fixes it. If it persists, mirror the weights locally and point the configuration at your mirror. I did not bake a mirror into the repository because I did not want to privilege a single hosting strategy.

Narrative distance from production

I keep repeating that this is experimental because I want the distance to be obvious. Production systems have change control, incident response, and accountability chains I am not simulating. When I say staleness_flag, I am not claiming an operational incident ticket exists. I am labeling a branch in my code. That distinction matters if someone reads this article quickly and assumes they can paste the repository into a live environment without additional work.

Closing Thoughts

What I learned about my own habits

I noticed that I reached for matplotlib faster than I reached for unit tests in the first week. That is not a brag. It is a confession. Charts feel like progress. Tests feel like discipline. In a longer project I would add tests around the synonym expansion function and the merge path because those are the places where silent bugs hide. For this PoC, I relied on manual inspection and repeated runs. I am documenting that choice because I want readers who clone the repository to know where rigor ends and storytelling begins in my own process.

A word on naming

I named the repository GuestResilience-HotelContext-AI because I wanted the words to sound operational without sounding like a product SKU. Names matter when you revisit a folder six months later. I have abandoned enough cleverly named experiments to appreciate boring clarity.

I started this experiment because I wanted a personal answer to a simple question: what does resilience mean when the model is small, the corpus is synthetic, and the user language is sloppy? My answer, for now, is that resilience looks like measurement first, bounded repairs second, and honest failure labels third. I do not think that answer is universal. I think it is a reasonable discipline for a PoC that might otherwise collapse into storytelling.

If I revisit the project, the first upgrade I would consider is a principled evaluation split: hold out chunks, paraphrase questions, and quantify how often expansion helps versus hurts. The second upgrade would be a real staleness pipeline, not a numeric field I typed by hand. The third upgrade would be an explicit separation between guest-facing summarization and evidence retrieval, even if both use models, because commingling them erodes auditability.

If nothing else, I hope this write-up convinces you that resilience can be practiced as a discipline even when the dataset is synthetic. The point is not to win a leaderboard. The point is to build a habit of measuring, repairing, and labeling failure without embarrassment.

I also want to leave you with a caution I apply to my own demos. Hospitality language intersects with accessibility, safety, and fairness. A retrieval stack that looks clever on a developer laptop can still be wrong in ways that matter. I wrote this as an experimental article precisely because I want room to be humble about those limits.

Tags: python, rag, machinelearning, hospitality

Thank you for reading this far. I know it is a long piece. I wrote it at this length because I wanted the reasoning trail to be inspectable, not because I enjoy typing for its own sake.

Disclaimer

The views and opinions expressed here are solely my own and do not represent the views, positions, or opinions of my employer or any organization I am affiliated with. The content is based on my personal experience and experimentation and may be incomplete or incorrect. Any errors or misinterpretations are unintentional, and I apologize in advance if any statements are misunderstood or misrepresented.

Layered Agentic Retrieval for Retail Floor Questions: A Solo PoC

Aniket Hingane — Sat, 04 Apr 2026 02:32:49 +0000

How I Routed Associate Questions Across Specialized TF-IDF Indexes Before Assembly

TL;DR

This write-up documents a personal experiment I ran while thinking about how retail associates actually use knowledge in the moment. A shopper rarely asks a question that fits neatly into a single policy PDF. The phrasing is noisy, the intent is mixed, and the clock is always ticking. I built a small Python system that treats each question as a routing problem rather than a retrieval problem with a single index. Three independent TF-IDF corpora stand in for returns policies, product care guidance, and service-floor procedures. An orchestrator scores each domain, retrieves top hits from the winner, and optionally blends in a second domain when the primary score looks weak. I kept the entire pipeline on-device without calling a hosted language model, because I wanted the evidence to be inspectable and reproducible on a laptop. The repository is public for learning purposes only; it is not a product recommendation, not a deployment blueprint, and not connected to anything I have shipped at a job. I am describing it as a proof of concept because that is what it is, and I am careful not to claim that this small corpus behaves like a real enterprise knowledge base. If you only take one sentence away, take this: I cared more about inspectable routing than about impressing anyone with model names, and that priority shaped every file I wrote.

Introduction

Before I describe the code, I want to anchor the emotional posture of this article. I am writing as an individual who builds experiments in public to learn, not as someone who is reporting on a deployed system or claiming validated business outcomes. That framing matters because retail is easy to romanticize and easy to misrepresent. I have spent a fair amount of time watching how people ask questions in retail settings, not as a researcher with formal instruments but as someone who pays attention to phrasing when I am in line or when I am helping a friend think through a store policy. The questions are rarely perfect. Someone might say “I need to return this” while also mentioning “the coating on the jacket feels wrong after one wash.” That sentence mixes two worlds. One world is about returns and receipts. The other is about care instructions and product durability. If you flatten everything into one retrieval index, you can still get plausible text, but you lose the ability to explain why a snippet was chosen. That loss of explainability matters to me as an engineer, not because I dislike neural models, but because I like to know which shelf the system reached for first.

In my PoC I chose a retail floor framing because it is relatable, easy to illustrate with synthetic documents, and avoids domains I have been avoiding in this personal writing series. I am not describing inventory optimization, warehouse counts, or pricing strategy here. I am not touching financial advice. I am staying with a narrow slice: how an associate-facing assistant might assemble evidence before anyone writes a customer-facing sentence. I also want to be clear about the social layer. Associates are not robots. They use judgment. A system that pretends to replace that judgment with a single score is not something I would defend. What I am experimenting with is a structured scaffold that keeps evidence visible so a human can still override.

I wrote this article because I wanted a serious project that still fits on a laptop. I have seen enough demos where a model produces fluent language and hides the underlying evidence. I wanted the opposite. I wanted logs that read like engineering notes, not marketing copy. The code is structured so that I can print a bundle of evidence rows with identifiers and cosine scores. That is not glamorous, but it is the kind of transparency I find useful when I iterate.

There is another motivation I should state plainly. I am interested in practices that survive contact with messy language. People shorten words, omit nouns, and rely on context. They say “the thirty-day thing” instead of “the return window policy.” Any retrieval system that assumes the query already contains canonical terms will fail in ways that look embarrassing on a demo but painful in real life. I did not solve that fully here. I only created a place to talk about it honestly while still writing code.

I also want readers to know the scope boundary I used while writing. This article discusses a synthetic dataset and illustrative scoring rules. It does not describe any real retailer’s policies, staffing model, or vendor contracts. If a phrase resembles language you have seen in the wild, that is because operational writing converges on similar vocabulary, not because I copied private material.

What's This Article About?

The article walks through RetailFloor-AgenticRouter-AI, a Python project that ingests short customer-style questions, scores three separate TF-IDF indexes, and retrieves evidence from the best-matching domains before optional blending.
I explain why I combined lexical hints with cosine similarity rather than relying on either signal alone in isolation.
I show how the batch table and matplotlib chart help me see whether the demo is skewing toward one domain because of a bug or because of the wording of the synthetic questions.
I discuss limitations honestly: tiny corpora, linear scoring, and heuristic thresholds are not the same as a live knowledge system.
I include a code walkthrough that mirrors how I read the repository myself when I return to it after a gap.

Tech Stack

The implementation is intentionally boring in a good way. I rely on Python 3.10 or newer, NumPy, scikit-learn for TF-IDF and cosine similarity, matplotlib for a bar chart, and Rich for readable terminal output. There is no hosted vector database and no cloud requirement; the entire index fits in memory.

From where I stand, that stack is enough to demonstrate the idea that “agentic routing” in the small can be practiced with classical IR tooling when the corpus is tiny and the goal is structured assembly rather than open-ended generation. If I later swap TF-IDF for embeddings, the per-domain interfaces remain stable, which was a design goal while I sketched the modules.

Why Read It?

If you are evaluating how to structure prompts or pre-model logic for operational assistants, this article offers a concrete pattern: treat context as composable blocks with clear boundaries.
If you are learning scikit-learn’s text pipelines, the retrieval module is short and testable.
If you care about reproducibility, the orchestrator gives you a baseline against which any future learned model can be compared.
I think the read is most useful for practitioners who want a middle ground between “pure LLM” and “pure rules,” because the code shows exactly where those worlds meet in my PoC.

There is also a pedagogical angle I care about. Many tutorials jump straight to embeddings and vector databases without establishing why lexical baselines still matter. I am not anti-embedding; I use them elsewhere. But I believe beginners should see cosine similarity on explicit vectors at least once, because it demystifies what “nearest neighbor” means in code rather than in marketing language.

Finally, if you maintain open-source examples, you know the burden of dependencies. I kept the stack small so a reader in a constrained environment can still run the demo. That constraint shaped decisions as much as any architectural principle.

Let's Design

Framing the problem without overfitting the story

Before touching code, I spent time writing short synthetic shopper questions on paper. I noticed recurring patterns: some messages emphasize receipts and timing early, others bury the actionable detail in the second half, and a few mix care language with service language. I did not try to split multi-intent questions into multiple tickets in this repository. Instead, I focused on a single-text input so the routing stays easy to reason about. That choice trades realism for clarity, and I am comfortable stating that upfront.

Why multiple indexes instead of one concatenated corpus

The design starts from a simple observation I kept returning to while prototyping: returns policy text is not interchangeable with product care guidance. They answer different questions. Policies talk about windows, receipts, and eligibility. Care guidance talks about fabric, detergents, and storage. Service-floor procedures talk about pickup, escalations, and documented steps for adjustments. When I mixed those prematurely, I got tangled retrieval results. When I separated them, I could log each domain independently.

The retrieval layer builds one TF-IDF vectorizer per domain. Each domain has a handful of short documents with identifiers. For each query, I compute a cosine similarity between the query vector and every document vector in that domain, and I take the maximum as a domain strength signal. That is a simple baseline, but it is a baseline I can explain to a colleague without drawing diagrams.

The orchestration layer combines those strengths with a small lexical hint. The hint is a deliberately limited set of regular expressions that look for words like “return,” “wash,” or “curbside.” I did not try to build a full intent model. I wanted a small nudge that prevents absurd routing when the vector space is sparse. In my opinion, that is a trade you can criticize. I would rather hear that criticism than pretend the vector space is larger than it is.

The agentic step as a confidence gate

I use the word “agentic” in a narrow sense. There is no autonomous loop that calls external tools. I am referring to a decision step that can widen retrieval when the primary domain looks weak. If the combined score for the primary domain falls below a threshold I tuned by hand, I pull additional hits from the next-best domain. That is not deep reasoning. It is a guardrail. I still call it agentic in the sense that the system chooses a second retrieval path based on measured confidence rather than a fixed pipeline.

If I were to extend this experiment, I would log the threshold crossings and measure how often the secondary blend helps. In this PoC, I only observe the behavior in the console.

Retrieval choices and what I rejected

I considered a few alternatives before settling on TF-IDF for the first public cut. A dense embedding model would likely rank semantically related chunks more robustly, but it would also introduce versioning questions, dependency weight, and reproducibility concerns for readers who just want to clone and run. I decided that demonstrating clean interfaces mattered more than squeezing extra retrieval quality from a miniature corpus.

I also thought about BM25. It is a strong baseline for lexical tasks and behaves well on short documents. I stayed with TF-IDF largely because the scikit-learn pipeline is familiar to many readers and the difference between BM25 and TF-IDF on a handful of short documents is unlikely to change the story materially. If I expand the corpus by an order of magnitude, BM25 or a hybrid approach becomes more compelling.

Observability as a first-class requirement

I log the evidence bundle for the first query in every run not because the first query is special, but because it proves the pipeline without drowning the reader in repetition. In a longer study I would probably log structured JSON for every query and ship it to a file, but the PoC keeps stdout readable.

The matplotlib chart is part of the same philosophy. A batch table tells you what happened row by row; a distribution tells you whether the demo batch skewed toward one domain. In my experiments, skew often revealed mistakes in keyword priorities rather than retrieval mistakes, which surprised me at first.

Let's Get Cooking

The entry point is main.py. It keeps the demo batch in one helper so the narrative stays obvious when someone reads top to bottom.

def _demo_queries() -> list[tuple[str, str, str]]:
    """(id, short label, customer text)"""
    return [
        (
            "Q-01",
            "returns window",
            "I bought jeans last week, tags still on, can I still bring them back with my receipt?",
        ),
        (
            "Q-02",
            "care",
            "How should I wash this water-resistant jacket without ruining the coating?",
        ),
        # ... additional synthetic questions ...
    ]

What this does: it defines the synthetic workload as tuples that include a question identifier, a human-readable label for my own notes, and the free-text body. I structured it this way because separating labels from the text lets me test routing without fabricating metadata inside the prose.

Why I wrote it this way: early on, I inlined labels as hashtags inside the text and immediately regretted it. Parsing labels from natural language is a separate project. For this PoC, explicit fields keep runs reproducible.

The orchestration layer is run_agentic_retrieval in orchestrator.py. It ranks domains, retrieves primary hits, and optionally blends secondary hits.

def run_agentic_retrieval(
    indexes: dict[IntentDomain, DomainIndex],
    query: str,
    top_k_per_domain: int = 2,
) -> OrchestratorResult:
    ranked = rank_domains(indexes, query)
    primary_domain, primary_score = ranked[0]
    secondary_domain, secondary_score = ranked[1]

    evidence: list[tuple[IntentDomain, RetrievalHit]] = []

    primary_hits = retrieve_top_k(indexes[primary_domain], query, k=top_k_per_domain)
    for h in primary_hits:
        evidence.append((primary_domain, h))

    rationale = (
        f"Primary domain {primary_domain.value} "
        f"(combined routing score={primary_score:.3f})."
    )

    if primary_score < _SECONDARY_THRESHOLD and secondary_score > 0.05:
        secondary_hits = retrieve_top_k(indexes[secondary_domain], query, k=top_k_per_domain)
        for h in secondary_hits:
            evidence.append((secondary_domain, h))
        rationale += (
            f" Secondary blend {secondary_domain.value} "
            f"(combined={secondary_score:.3f}) because primary score stayed below "
            f"{_SECONDARY_THRESHOLD}."
        )
        return OrchestratorResult(
            query=query,
            primary_domain=primary_domain,
            primary_score=primary_score,
            secondary_domain=secondary_domain,
            secondary_score=secondary_score,
            evidence=evidence,
            rationale=rationale,
        )

    return OrchestratorResult(
        query=query,
        primary_domain=primary_domain,
        primary_score=primary_score,
        secondary_domain=None,
        secondary_score=None,
        evidence=evidence,
        rationale=rationale,
    )

What this does: it materializes the agentic gate in code. If the primary score is below threshold, I append evidence from the runner-up domain. If not, I keep the evidence narrow.

Why I wrote it this way: I wanted the branching logic to be explicit and readable. A hidden implicit merge would have made debugging harder when I tuned thresholds.

The hybrid score itself is computed in rank_domains by combining cosine strength with lexical boosts. I kept the boosts small so they cannot dominate the vector signal.

def _combined_score(indexes: dict[IntentDomain, DomainIndex], domain: IntentDomain, query: str) -> float:
    base = domain_strength(indexes[domain], query)
    b = _lexical_boost(domain, query)
    return float(min(1.0, base + b))

What this does: it anchors the routing in measurable similarity while still allowing short, common retail words to nudge the domain when the corpus is tiny.

Per-domain retrieval is standard TF-IDF with cosine similarity.

def retrieve_top_k(index: DomainIndex, query: str, k: int = 3) -> list[RetrievalHit]:
    qv = index.vectorizer.transform([query])
    sims = cosine_similarity(qv, index.doc_matrix).ravel()
    order = np.argsort(-sims)[:k]
    hits: list[RetrievalHit] = []
    for i in order:
        doc = index.docs[int(i)]
        hits.append(
            RetrievalHit(
                doc_id=doc.doc_id,
                score=float(sims[int(i)]),
                snippet=doc.text,
            )
        )
    return hits

What this does: it returns the top-k most similar documents within a single domain index. That is the building block the orchestrator repeats.

Let's Setup

Clone the public repository from GitHub: https://github.com/aniket-work/RetailFloor-AgenticRouter-AI
Create a virtual environment inside the project directory using python -m venv venv
Activate the environment using source venv/bin/activate on macOS or Linux
Install dependencies with pip install -r requirements.txt

Step-by-step details can be found at the repository README in the same project. I prefer local virtual environments because they keep experiments isolated and reproducible.

Let's Run

Run python main.py from the repository root with the virtual environment active.
Read the first evidence bundle, which prints the primary domain and the retrieved snippets with scores.
Read the batch summary table and open output/domain_distribution.png to see the matplotlib distribution.

Theory and personal notes on similarity in tiny corpora

When I describe TF-IDF to someone who has only used embeddings, I often start with frequency rather than geometry. Term frequency inside a document tells you what the document emphasizes. Inverse document frequency down-weights terms that appear everywhere. Once you have a sparse vector per document and a vector for the query, cosine similarity becomes a concrete operation: a dot product normalized by magnitudes. That story is not new. I still find it valuable because it connects the math to the words people actually typed.

In my experiments with this PoC, the corpus is so small that IDF behaves differently than it would on a large intranet crawl. Rare words can dominate. Common words can look overly important if they appear in multiple documents within the same domain. I noticed that when I added a few more sentences to one policy snippet, the relative rankings shifted more than I expected. That sensitivity is not a secret flaw; it is a reminder that retrieval quality tracks corpus curation.

I also thought about correlation between domains. In a single merged index, a query might retrieve a returns snippet and a care snippet together because both mention “original packaging” or similar phrasing. By splitting indexes, I force the system to decide which domain is primary before I show mixed evidence. That decision can be wrong, but at least it is explicit. In my opinion, explicit wrongness is easier to debug than implicit blending.

Another topic I want to address is calibration. Cosine scores on TF-IDF are not probabilities. I still print them because relative ranking matters more than absolute numbers for this demo. If I ever publish a more serious evaluation, I would separate “routing accuracy” from “snippet usefulness” and measure them with different labeled sets. For now, I rely on a batch table and a chart, which is humble instrumentation, but it matches the scale of the project.

Deeper walkthrough of the batch questions I chose

I picked six questions because they cover different shapes of language without pretending to be a comprehensive benchmark. The returns window question is direct. The care question uses product language. The curbside question pulls service-floor vocabulary. The “ambiguous” running shoe question is intentionally written to stress the boundary between care guidance and subjective comfort language. The gift receipt question tests whether returns language routes cleanly. The escalation question tests service-floor procedures.

When I first ran the batch, I looked at the primary domain column before I looked at scores. That habit comes from older debugging practice: identify the categorical outcome, then inspect the numeric confidence. In my PoC, I also read the chart to see whether one domain swallowed the batch. If that happened without a good reason, I assumed I had a bug or a badly worded question. That is a cheap sanity check, but it caught a few mistakes while I iterated.

What I would measure next if I invested another weekend

Precision and recall for domain routing against a labeled set of at least two hundred queries.
Mean reciprocal rank for the top evidence snippet within the correct domain.
Frequency of secondary-blend activations and whether those cases correlate with improved human ratings.

I have not done that work here. I am stating it as a matter of intellectual honesty. A chart in a blog post is not an evaluation suite.

Failure modes I observed while iterating

Sometimes the vector score for the primary domain is modest even when the lexical hint is strong. In those cases, the combined score still tends to land on the right domain because the hint is doing real work. The opposite also happened during early drafts: strong cosine matches to the wrong domain because of shared words like “store” or “order.” I addressed some of that by tightening documents so repeated generic words appear less often, but the real fix for a production setting would be more documents and better tokenization choices, not cleverer regex.

I also saw cases where the secondary blend did not trigger because the primary score crossed the threshold even though a human would still want cross-domain evidence. That is a design tension. If I lower the threshold, I blend more often and risk noisy evidence. If I raise the threshold, I stay pure but miss helpful cross-domain context. I do not think there is a universal answer. I think there is only a policy choice that should be explicit.

Why I avoided a hosted model in the core loop

I am not opposed to models. I use them in other projects. In this PoC, I wanted the repository to remain lightweight and the behavior to remain inspectable for readers who may not have API keys or budget. I also wanted to avoid a moving target. Hosted models change versions; retrieval baselines change less often. For a learning artifact, stability matters.

If I integrated a model later, I would still keep the routing structure. The orchestrator pattern is not tied to TF-IDF. It is a way to decide which evidence shelves to open. In my view, that separation ages well.

Statistics and visualization as a discipline habit

The matplotlib chart is simple: counts of primary domain picks across the batch. I still find it useful because it forces a second perspective on the same data. Tables can hide imbalance when you are focused on individual rows. A bar chart makes imbalance obvious. As per my experience writing operational tooling, that kind of redundancy is how I catch mistakes before they become narratives.

I also think visualization discipline matters when writing publicly. A demo can look convincing because the author cherry-picked queries. A batch section with a chart is still cherry-picked, but it is harder to hide systemic skew without looking inconsistent. I am not claiming purity. I am claiming a slightly higher bar than a single happy-path screenshot.

How this relates to my broader experiments with context assembly

Across several personal PoCs, I keep returning to the same lesson: the model is only as grounded as the evidence you hand it. Routing is one way to ground. Chunking is another. Metadata filters are another. In this retail framing, routing is the headline because it is the piece that most directly mirrors how I think a careful associate works. I picture someone mentally classifying the question before reaching for a binder or a search box. The code is a crude metaphor for that mental step.

I also think solo experiments have a hidden advantage. There is no committee to smooth the awkward edges. If a design choice is hard to explain, I feel it immediately because I have to write the article myself. That pressure improves clarity even when it does not improve novelty. In my opinion, many good engineering blogs come from that kind of forced explanation rather than from raw brilliance.

Data hygiene notes for anyone who forks the repository

If you fork this project and replace the synthetic corpus with your own text, start with document boundaries. Decide what constitutes a chunk. Decide whether headings belong in the chunk or in metadata. Decide whether you need stable identifiers for compliance reasons. I used short stable identifiers like ret_001 because they read well in logs. In a real setting, you might need provenance fields and timestamps. None of that appears here because I am not simulating compliance tooling.

I would also caution against mixing marketing language with policy language in the same chunk unless you intend to. Marketing copy often uses emotional words that pollute retrieval for operational questions. In my synthetic set, I tried to keep tone dry and procedural. That choice is artificial, but it is intentionally artificial.

Security and privacy framing

This PoC does not store customer data, payment data, or loyalty identifiers. I mention loyalty only inside synthetic policy text as a generic procedural note. If you adapt the idea to real systems, you should treat evidence logs as sensitive depending on your environment. Retrieval indexes can leak information through side channels if someone can probe them repeatedly. I am not providing a threat model here. I am naming the concern because responsible write-ups should name concerns even when the demo is synthetic.

Longer reflection on agentic retrieval as a phrase

Language shifts quickly in this field. I use “agentic” because the orchestrator makes a conditional decision that changes which retrievals run. That is a narrow meaning. I am not claiming autonomous agency, persistent memory, or tool use beyond vector retrieval. If the word feels too flashy for your taste, you can substitute “conditional multi-retrieval” and the code still reads the same.

From my perspective, the value of the word is that it signals intent to practitioners who are comparing patterns. If you are building a catalog of approaches, you want names that map to behaviors. The behavior here is: measure confidence, branch, merge evidence. That is enough to distinguish the flow from a single-query single-index baseline.

Expanded discussion of matplotlib output and how I read it

The chart file is written to output/domain_distribution.png. When I open it, I look for dominance first. If one bar towers over the others, I ask whether the batch questions were written to favor that domain accidentally. Then I look at the absolute counts. With six questions, ties and singletons are common. That is fine for a story, but it would be insufficient for a statistical claim. I treat the chart as a sanity check, not as proof of generalization.

I also think about color and accessibility. The default color cycle in matplotlib is familiar to many readers, but it is not perfect for every color vision profile. If I extended this project with a public UI, I would revisit palette choices. For a static PNG in a repository, I kept the defaults to reduce dependencies and keep the focus on structure.

What I changed between iterations while writing this article

Early drafts used only cosine similarity without lexical hints. The routing looked mathematically pure but sometimes felt silly in plain language. I added small boosts because I wanted the demo to track common sense when the corpus is tiny. Some readers will dislike that because it introduces hand-tuned rules. I accept the criticism. I would rather show a transparent hand-tuned rule than hide the same bias inside an unlabeled embedding space.

I also adjusted the secondary threshold upward once I saw how often the blend triggered. The goal was to make blending meaningful rather than routine. If blending happens on every query, it stops being a guardrail and becomes a second retrieval path you always pay for.

Closing Thoughts

What I learned about thresholds

I spent more time than I expected tuning the secondary threshold and lexical boosts. That is typical for small corpora. When you only have a few documents per domain, cosine similarity can swing based on a single shared word. I do not consider that a flaw in cosine similarity. I consider it a reminder that the corpus is the real product.

Edge cases I still think about

Multi-intent questions that require two different operational actions, not just two evidence bundles.
Questions that reference SKU numbers or store-specific hours that are not in the synthetic corpus.
Situations where policy language conflicts across documents, which this PoC does not attempt to resolve.

Ethics and responsible framing

I believe any assistant that touches customer-facing work should default to transparency. That means citing sources, showing scores, and making it obvious when the system is uncertain. I did not build a customer-facing UI here, but I did build the kind of evidence rows I would want to see before trusting a draft answer.

Roadmap if this stays a hobby experiment

Replace TF-IDF with BM25 or embeddings when the corpus grows.
Add evaluation harnesses with labeled queries rather than eyeballing batch tables.
Add structured logging to JSON for offline analysis.

A few more words on reproducibility and environment capture

Whenever I publish a PoC, I ask myself whether someone can reproduce the same numbers on their machine. For this project, the deterministic parts are the TF-IDF fit and the query order. The parts that can drift are library versions and floating point noise. I pinned versions loosely in requirements.txt with minimum versions rather than exact hashes because this is not a safety-critical artifact. If I needed bitwise reproducibility, I would pin exact versions and record a seed wherever randomness appears. Randomness does not play a role in the current retrieval path, which keeps the story simpler.

I also think about documentation as part of reproducibility. A repository without a clear run command is a puzzle, not an experiment. That is why I kept main.py as a single entry point and why I describe the output paths explicitly. From my experience, the fastest way to lose a reader is to hide the command they should run after cloning.

Narrative closure without overstating the result

I want to end the technical portion with a calm statement of scope. This PoC demonstrates routing and retrieval assembly. It does not demonstrate customer satisfaction. It does not demonstrate associate productivity. It does not demonstrate compliance alignment. Those outcomes require measurements I did not perform. I am naming the gap on purpose because overstated claims are how experimental articles age poorly.

Repository link

Public code for this experiment: https://github.com/aniket-work/RetailFloor-AgenticRouter-AI

I keep this repository separate from my publishing scripts so the public repo only contains the PoC implementation, diagrams, and images. If you clone it, you will not find my article drafts or automation utilities mixed into the same tree, because I want the repository to stay a clean reference implementation for the idea itself.

Disclaimer

Tags: python, retail, machinelearning, agents

Layered Context Routing for Campus Operations: A Facilities Intake PoC

Aniket Hingane — Fri, 03 Apr 2026 03:19:04 +0000

How I Stacked Policy, Place, and Urgency Signals to Route Maintenance Requests

TL;DR

This write-up describes a personal experiment where I treat campus facilities intake as a context engineering problem rather than a single prompt. I combine TF-IDF retrieval over a small policy corpus with building metadata and lightweight urgency hints parsed from free text, then route tickets with explicit rules so every decision stays inspectable. The code lives in a public repository I published for learning purposes, and nothing here should be read as production guidance for a real university or as anything connected to an employer. From my perspective, the lesson worth sharing is that when operational language is messy, stacking context in named layers makes debugging and iteration far easier than stuffing everything into one opaque blob.

Introduction

I have spent a fair amount of time thinking about how large language models behave when the input is short, ambiguous, and emotionally loaded. Facilities tickets are a good toy domain for that reason. A message might mention a fume hood, a basketball practice schedule, and a broken card reader in adjacent sentences. If you send that straight into a generic completion call, you can get fluent text that is wrong in subtle ways. As per my experience, the failure mode is rarely “the model cannot write sentences.” It is usually “the model does not know which institutional rule actually applies,” or “the model over-trusts the most recent sentence.”

In my experiments, I wanted a system that still fits on a laptop, does not require a proprietary dataset, and makes the reasoning chain visible in ordinary logs. I chose a campus operations framing because it forces a blend of safety language, building-specific nuance, and time-of-day common sense without touching regulated domains I am intentionally avoiding in this series. The repository is a solo sketch, not a deployed service, and I refer to it throughout as a proof of concept.

There is another motivation I should state plainly. I am interested in practices that survive contact with maintenance engineers, students, and staff who do not care about the underlying ML buzzwords. People submit tickets under stress. They shorten building names, omit room numbers, and reference “that hallway near the lab” without GPS coordinates. Any system that pretends the text is already structured is going to fail in ways that look embarrassing on a demo but painful in real life. I did not solve that fully here; I only created a place to talk about it honestly while still writing code.

I also want readers to know the scope boundary I used while writing. This article discusses a synthetic dataset and illustrative SLAs. It does not describe any real institution’s priorities, staffing model, or vendor contracts. If a phrase resembles language you have seen in the wild, that is because operational writing converges on similar vocabulary, not because I copied private material.

What's This Article About?

The article walks through the design of CampusContextRouter-AI, a Python project that ingests synthetic maintenance-style requests, retrieves relevant policy snippets, attaches place context from a JSON registry, derives urgency signals from the wording, and emits a route bucket with a priority band and a notional SLA window. I wrote it this way because I wanted to mirror how a human dispatcher glances at policy, then place, then severity, before choosing a queue.

You will see how I separate retrieval from routing, why I kept the router deterministic in this iteration, and how I generate both a Rich table for the terminal and a simple matplotlib chart so a batch run has a visual artifact. I also discuss limitations honestly: tiny corpora, linear scoring, and heuristic SLAs are not the same as a live work-order system.

If you are wondering what “context engineering” means in concrete terms here, my working definition is simple: decide what information belongs together, decide what must never be mixed, and serialize the result in a predictable shape. Retrieval produces evidence. Place metadata grounds the evidence. Session signals modulate urgency. Routing consumes all three without collapsing them into an undifferentiated string. That definition may differ from how other authors use the phrase, and that is fine; the implementation is the ground truth for this PoC.

You should also expect commentary on failure modes. A demo that only shows happy paths is a brochure, not engineering writing. I call out retrieval sparsity, policy conflict, and the limits of regex urgency. I discuss what I would measure next if this stayed a hobby project for more than a few weekends.

Tech Stack

From where I stand, that stack is enough to demonstrate the idea that “context engineering” can be practiced with classic IR tooling when your corpus is small and your goal is structured assembly rather than open-ended generation. If I later swap TF-IDF for embeddings, the layered interfaces remain stable, which was a design goal while I sketched the modules.

Why Read It?

If you are evaluating how to structure prompts or pre-model logic for operational chatbots, this article offers a concrete pattern: treat context as composable blocks with clear boundaries. If you are learning scikit-learn’s text pipelines, the retrieval module is short and testable. If you care about reproducibility, the deterministic router gives you a baseline against which any future learned model can be compared.

I think the read is most useful for practitioners who want a middle ground between “pure LLM” and “pure rules,” because the code shows exactly where those worlds meet in my PoC.

Let's Design

Framing the problem without overfitting the story

Before touching code, I spent time writing short synthetic tickets on paper. I noticed recurring patterns: some messages emphasize harm or hazard words early, others bury the actionable detail in the second half, and a few mix multiple issues that would normally be split in a mature work-order system. I did not try to solve splitting in this repository. Instead, I focused on a single-text input so the context layers stay easy to reason about. That choice trades realism for clarity, and I am comfortable stating that upfront.

Why layers instead of one concatenated prompt

The design starts from a simple observation I kept returning to while prototyping: policy text is not interchangeable with place metadata. Policies answer what must happen in general. Place metadata answers where the work lives and what constraints repeat for that site. Session signals answer how hot the ticket sounds and whether the clock matters. When I mixed those prematurely, I got tangled prompts. When I separated them, I could log each layer independently.

The retrieval layer reads data/policies.json. Each record is a chunk with an identifier, a topic tag, and prose. The TF-IDF vectorizer uses English stop words and unigrams plus bigrams to catch phrases like “fume hood” that unigrams alone might dilute. For each ticket, I take the top few chunks by cosine similarity and format them as a bullet list with scores.

The place layer reads data/buildings.json. Each building has a code, human-readable name, zone label, hours profile, and short risk notes. I do not attempt geospatial reasoning in this PoC; the point is to show how a second JSON source can be merged without contaminating the policy text.

The signal layer currently combines a local hour and weekday flag with an urgency score derived from regular-expression keyword groups on the ticket text. The score is deliberately primitive. In a later experiment I might replace it with a lightweight classifier, but I wanted something explainable first.

Routing maps the assembled layers to an enumerated bucket such as laboratory safety, classroom AV, grounds, HVAC, or a general bucket. Priorities and SLA hours are assigned with transparent rules that look at both the keyword path and the urgency score. That logic lives entirely in Python so I can unit test it without GPU dependencies.

Architecturally, the flow is linear: load JSON, fit the TF-IDF index once, iterate demo tickets, assemble layers per ticket, call the router, collect rows, render a Rich table, and plot bucket counts. The diagrams in the repository restate the same story visually.

Retrieval choices and what I rejected

I also thought about BM25. It is a strong baseline for lexical tasks and behaves well on short documents. I stayed with TF-IDF largely because the scikit-learn pipeline is familiar to many readers and the difference between BM25 and TF-IDF on eight short policies is unlikely to change the story materially. If I expand the corpus by an order of magnitude, BM25 or a hybrid approach becomes more compelling.

Urgency scoring as a deliberately imperfect heuristic

The urgency score is built from weighted regular expressions. That looks naive, and it is naive. I still found value in it because it forces me to name the cues I care about: leaks, odors, elevators, HVAC loss, outdoor lighting, and a handful of AV terms. Each cue adds a partial weight capped at one. The cap matters; without it, a long message with many benign keywords could look hotter than a short emergency note.

When I tested early versions, I saw false positives where “water” appeared in a benign sentence. I tightened patterns to word boundaries and preferred compound cues. This is not a claim that regex is sufficient in production. It is a claim that explainable baselines are useful when you compare future models against something you can read in a single screen.

Observability as a first-class requirement

I log the layered context for the first ticket in every run not because the first ticket is special, but because it proves the pipeline without drowning the reader in repetition. In a longer study I would probably log structured JSON for every ticket and ship it to a file, but the PoC keeps stdout readable.

The matplotlib chart is part of the same philosophy. A batch table tells you what happened row by row; a distribution tells you whether the demo batch skewed toward one bucket. In my experiments, skew often revealed mistakes in keyword priorities rather than retrieval mistakes, which surprised me at first.

Let's Get Cooking

The entry point is main.py. It keeps the demo batch in one helper so the narrative stays obvious when someone reads top to bottom.

def _demo_tickets() -> list[tuple[str, str, str, int, bool]]:
    """(id, building_code, text, hour_local, weekday)"""
    return [
        (
            "T-1001",
            "SCI-E",
            "Strong chemical odor near fume hood B2; two students reported eye irritation.",
            14,
            True,
        ),
        # ... additional synthetic tickets ...
    ]

What this does: It defines the synthetic workload as tuples that include a ticket identifier, a building code that keys into buildings.json, the free-text body, and a synthetic clock. I structured it this way because separating clock information from the text lets me test urgency scoring and time-aware policies independently without fabricating timestamps inside the prose.

Why I wrote it this way: Early on, I inlined timestamps as strings inside the ticket text and immediately regretted it. Parsing times from natural language is a separate project. For this PoC, explicit fields keep runs reproducible.

The layered assembly happens through assemble_layers in context_layers.py. The function pulls retrieval results, formats three blocks, and returns a LayeredContext dataclass.

def assemble_layers(
    *,
    query_text: str,
    building_code: str | None,
    session: SessionSignals,
    index: PolicyIndex,
    buildings: dict[str, BuildingContext],
    top_k: int = 3,
) -> LayeredContext:
    retrieved = index.top_k(query_text, k=top_k)
    policy_block = _format_policy_block(retrieved)
    if not retrieved:
        policy_block = "POLICY SNIPPETS: (no retrieval match; use baseline routing rules)"

    b = buildings.get(building_code) if building_code else None
    place_block = _format_place_block(b)
    signal_block = _format_signal_block(session)

    return LayeredContext(
        policy_block=policy_block,
        place_block=place_block,
        signal_block=signal_block,
        retrieved=retrieved,
    )

What this does: It centralizes formatting so the router always sees the same headings for each layer. Empty retrieval is handled with a explicit fallback string rather than silent failure.

Why I structured it this way: In my opinion, the hardest part of small retrieval systems is debugging silent degradation. If nothing matches, I want that fact visible in the console output.

Retrieval itself is a thin wrapper around scikit-learn.

class PolicyIndex:
    """In-memory TF-IDF index over policy text."""

    def __init__(self, chunks: list[PolicyChunk]) -> None:
        self._chunks = chunks
        corpus = [c.text for c in chunks]
        self._vectorizer = TfidfVectorizer(
            lowercase=True,
            stop_words="english",
            ngram_range=(1, 2),
            min_df=1,
        )
        self._matrix = self._vectorizer.fit_transform(corpus)

    def top_k(self, query: str, k: int = 3) -> list[RetrievedChunk]:
        q = self._vectorizer.transform([query])
        sims = cosine_similarity(q, self._matrix).flatten()
        order = np.argsort(-sims)
        out: list[RetrievedChunk] = []
        for idx in order[:k]:
            score = float(sims[idx])
            if score <= 0:
                continue
            out.append(RetrievedChunk(chunk=self._chunks[int(idx)], score=score))
        return out

What this does: It builds a matrix once and scores incoming ticket text as another vector. Cosine similarity ranks chunks, and I discard zero scores to avoid clutter.

What I learned: On a toy corpus, bigrams matter. Without them, “fume hood” sometimes loses to generic maintenance words. I kept the corpus tiny on purpose to force myself to think about chunk wording.

The router combines keyword cues, the top retrieved topic, and session urgency.

def decide_route(
    *,
    free_text: str,
    layered: LayeredContext,
    building: BuildingContext | None,
    session: SessionSignals,
) -> RoutingDecision:
    topic = _topic_from_retrieval(layered)
    keyword_bucket = _adjust_bucket_from_keywords(free_text)

    bucket: RouteBucket
    if keyword_bucket is not None:
        bucket = keyword_bucket
    elif topic == "laboratory":
        bucket = RouteBucket.SAFETY_EHS
    elif topic == "classroom_av":
        bucket = RouteBucket.AV
    # ... additional topic mappings ...
    else:
        bucket = RouteBucket.GENERAL

    urgency_score = session.urgency_hint_score
    priority = "P2"
    sla = 48.0

    if bucket in (RouteBucket.SAFETY_EHS, RouteBucket.PLUMBING) and urgency_score > 0.3:
        priority = "P0"
        sla = 4.0
    elif bucket == RouteBucket.ACCESS:
        priority = "P1"
        sla = 8.0
    # ... additional escalations ...

    return RoutingDecision(
        bucket=bucket,
        priority=priority,
        sla_hours=sla,
        rationale="; ".join(rationale_parts),
    )

What this does: It makes the decision path explicit. Keyword overrides fire first because certain phrases imply a channel regardless of retrieval noise. Topic labels from the best chunk act as a secondary signal. SLA tightening uses both bucket membership and the urgency score.

Why I put it this way: I needed a single function I could read during demos without opening a notebook. The rationale string is there so future me remembers why a ticket landed where it did.

Finally, plotting is one function that turns bucket names into counts.

def plot_bucket_distribution(
    buckets: list[str],
    out_path: Path,
) -> None:
    counts = Counter(buckets)
    labels = sorted(counts.keys())
    values = [counts[k] for k in labels]

    fig, ax = plt.subplots(figsize=(8, 4.5))
    ax.bar(labels, values, color="#2c5282")
    ax.set_title("Synthetic routing batch: bucket counts")
    ax.set_ylabel("Tickets")
    ax.tick_params(axis="x", rotation=35)
    fig.tight_layout()
    out_path.parent.mkdir(parents=True, exist_ok=True)
    fig.savefig(out_path, dpi=120)
    plt.close(fig)

What this does: It produces a basic bar chart so the batch run is not only textual. For the animated cover asset, I used that chart as the UI half of the GIF after the terminal sequence.

Repository link: The full project, including diagrams and the terminal animation asset, is available at https://github.com/aniket-work/CampusContextRouter-AI

Reporting code stays intentionally thin: tables for people, files for artifacts.

def print_routing_table(
    console: Console,
    rows: list[tuple[str, str, str, str, str]],
) -> None:
    table = Table(title="Campus facilities intake (batch routing)")
    table.add_column("Ticket", style="cyan", no_wrap=True)
    table.add_column("Building", style="magenta")
    table.add_column("Route", style="green")
    table.add_column("Pri", justify="center")
    table.add_column("SLA h", justify="right")

    for ticket_id, building, route, pri, sla in rows:
        table.add_row(ticket_id, building, route, pri, sla)

    console.print(table)

What this does: It renders aligned columns with consistent headers so a batch run looks like a dispatch screen rather than a raw log dump.

Why I structured it this way: In my opinion, presentation quality changes how seriously I take my own outputs during development. If the table looks sloppy, I assume the logic is sloppy.

The urgency helper is small but central to how priorities tighten.

_URGENCY_WORDS = [
    (r"\b(leak|flooding|flood|spill|water)\b", 0.35),
    (r"\b(smoke|fire|odor|fume|chemical)\b", 0.45),
    (r"\b(elevator|stuck|door won\'t open|door wont open)\b", 0.25),
    (r"\b(no heat|no ac|freezing|overheat)\b", 0.25),
    (r"\b(projector|microphone|av|audio|display)\b", 0.15),
    (r"\b(light|outage|dark|walkway)\b", 0.12),
]


def _urgency_hint(text: str) -> float:
    t = text.lower()
    score = 0.0
    for rx, w in _URGENCY_WORDS:
        if re.search(rx, t):
            score += w
    return min(1.0, score)

What this does: It scans for cues that should raise urgency regardless of which policy chunk wins retrieval.

What I learned: Weight tuning is subjective. I chose weights that made the science-lab odor scenario land in a high band without pushing every AV ticket into emergency territory.

How this differs from “just prompt better”

It is tempting to believe a single system message can replace structured preprocessing. Sometimes that works for short tasks. For operational intake, my experience has been that models benefit from retrieval that is inspectable outside the model. I am not arguing against LLMs; I am arguing that the PoC should show where the boundaries belong. If the retrieval list is wrong, I can fix the corpus or the vectorizer without touching the router. If the router rules are wrong, I can adjust routing without touching retrieval. That separation of concerns saved me time during debugging.

What a language model could do in a later iteration

If I add a model, I would keep it as a rewriter or validator, not as the sole authority. A plausible pattern is: assemble layers exactly as today, ask the model to propose a bucket and rationale, then compare against deterministic rules. Disagreements become training data or prompts for refinement. I have not implemented that here because I wanted the repository to remain runnable without API keys, but the layering is compatible with that roadmap.

Performance characteristics I measured informally

This is not a benchmark article, but I did sanity-check runtime on a laptop. Fitting the TF-IDF matrix on eight policies is effectively instantaneous. Routing five tickets is trivial. Matplotlib dominates wall time relative to retrieval, which reinforces that the PoC is not CPU-bound. If I scaled to thousands of policies, I would need a more serious index and probably batch vectorization, but that is not the bottleneck today.

Let's Setup

Step-by-step details can be found in the repository README. At a high level, the setup I used while iterating locally follows a predictable pattern.

Create an isolated virtual environment in the project directory so dependencies never leak across unrelated experiments.
Install requirements from requirements.txt exactly as pinned there to avoid surprise upgrades to scikit-learn behavior.
Run python main.py with the bundled demo batch to confirm retrieval, routing, and chart generation all succeed on your machine.

If you clone the repository, you will notice there is no .env requirement for the baseline demo. I kept secrets out of the PoC on purpose so CI or readers can execute it without API keys.

Let's Run

When I run python main.py, the program prints the full layered context for the first ticket, then prints the batch table, then writes output/routing_bucket_distribution.png. That order is intentional: the first block proves the retrieval and formatting pipeline, the table proves routing consistency, and the chart proves that visualization hooks stay wired.

In my observation, the most interesting console output is the policy snippet list with scores. Even with eight policies, you can watch how sensitive the ranking is to verbs like “odor” versus “noise.” That sensitivity informed how I wrote the synthetic tickets.

Edge Cases I Thought About

No PoC is complete without acknowledging where it would break. These points are worth spelling out because they shaped what I did not attempt.

Sparse retrieval: If the ticket uses slang that never appears in the policy corpus, TF-IDF may return low scores across the board. The router still runs, but the topic signal becomes weak. A mitigation I considered is hybrid retrieval with a keyword inverted index, which would be a natural extension.
Conflicting policies: Real campuses can have overlapping rules. I store independent chunks and do not yet model precedence. In a future iteration, explicit precedence edges between chunk IDs would be cleaner than hoping retrieval ranks resolve conflicts.
Time semantics: I pass hour and weekday as integers rather than parsing from text. That avoids accidental contradictions between embedded timestamps and structured fields.
Equity and access: Routing touches accessibility topics. I include them because ignoring them would be unrealistic, but the SLA numbers are illustrative only. A production system would need institutional review, not a hobby script.
Duplicate submissions: Real users open multiple tickets for the same incident. I do not deduplicate or thread conversations in this repository. A deduplication layer would likely sit upstream of retrieval, comparing embeddings of entire messages and linking to an incident ID.
Seasonality: A field house ticket in January is not the same as one in August. My building metadata includes a seasonal hours profile label, but I do not dynamically adjust SLAs by season. Extending the signal layer with calendar metadata would be straightforward, but it would also require more realistic data than I wanted to maintain for a hobby repo.
Language and tone: The PoC assumes English prose of moderate formality. Multilingual campuses would need tokenization choices and policy corpora per language. I did not attempt multilingual retrieval because verifying quality would require skills and resources outside the scope of a solo weekend project.
Malicious input: Free-text fields can be abused. I do not implement content filtering here. If this were more than a local script, I would add rate limits, length limits, and basic abuse detection before any retrieval occurs.

Ethics and Responsibility

Even though this repository is synthetic, the language of safety and access deserves care. I wrote the tickets and policies to resemble realistic operational phrases without copying any private incident text. From my perspective, that distinction matters: public demos should never repurpose confidential work orders.

I also want to be explicit that automated routing for physical risk scenarios should never be the only line of defense. The PoC emits labels; it does not dispatch tradespeople, text students, or close tickets. Treat it as a learning scaffold.

Future Roadmap (Personal Experiments Only)

If I revisit this repository, several extensions seem worthwhile, still on my own time and still labeled experimental.

Replace TF-IDF with embeddings while keeping the same layer boundaries, then measure how often the router disagrees with the baseline.
Add a second corpus for local procedures that updates more frequently than policy, mimicking how some campuses separate “policy” from “playbook.”
Introduce evaluation harnesses with labeled tickets, even if synthetically expanded, so precision and recall become measurable instead of eyeballed.
Wrap the router output as JSON for a tiny local web UI if I want a friendlier demo than the terminal alone.
Add property tests that generate random word order permutations for tickets to see whether retrieval remains stable enough for my tolerance thresholds.
Explore calibration for urgency scoring so numeric outputs map to observed human labels in a small user study. That is far beyond this PoC, but worth naming as a scientific next step rather than an engineering tweak.

None of those items are promises; they are directions I might explore when curiosity and spare time align.

Documentation habits that helped me

While building this, I noticed that my velocity correlated with how aggressively I documented assumptions in the README. Not tutorial prose for beginners, but crisp statements of non-goals. When I wrote “illustrative SLA,” I stopped myself from secretly believing the numbers meant more than they did. When I listed repository layout, I caught a path mistake before publishing.

I also kept commit messages boring on purpose. Small repositories deserve readable history too. If I ever return to this code months later, I want future me to recognize intent without decoding clever commit titles.

Reproducibility notes

Reproducibility is part engineering and part discipline. Pinning dependencies avoids the subtle drift that happens when scikit-learn changes defaults across versions. Keeping random seeds matters when you add stochastic components; this PoC has none, which is a feature for now. Recording the Python version in the README is a small touch that prevents “works on my machine” surprises.

If you fork the repository, consider writing down your own environmental constraints. I develop on macOS, but the code should run anywhere Python runs. If matplotlib backend issues appear on a headless server, switching to a non-interactive backend is a known fix; I did not need it for local runs.

A longer note on evaluation and what I did not measure

Evaluation is where hobby projects either mature or remain toys. In this PoC, I relied on manual inspection: reading retrieved snippets, checking whether the science lab odor ticket escalated appropriately, and scanning the distribution chart for obvious skew. That approach is acceptable for a first public version because the goal was architectural clarity, not leaderboard scores.

If I were to formalize evaluation without access to private tickets, I would start by synthesizing a larger labeled set from templates. I would vary lexical overlap, negation, and multi-issue messages. I would measure top-k recall for policy topics and confusion matrices for route buckets. I would also track stability: if I paraphrase a ticket without changing meaning, does retrieval remain broadly consistent? That kind of robustness test often reveals brittleness faster than aggregate accuracy.

I did not build those harnesses here because they expand repository scope quickly. Test data management is a project of its own. Still, naming the gap matters. Readers should not mistake a crisp table output for validated operational performance.

Visual assets and why the GIF exists

The repository includes Mermaid diagrams rendered to PNG via mermaid.ink because I wanted graphics that match the tone of technical documentation rather than stock photography. The animated GIF pairs a terminal sequence with a matplotlib chart to mirror how I actually work: run a script, scan the table, glance at a plot. Creating the GIF took extra time, but from my perspective it communicates intent faster than static screenshots alone.

I followed a strict palette conversion pipeline for the GIF to reduce flicker on platforms that are picky about animated assets. The details are mundane image processing, but the outcome matters when you publish where rendering quirks exist.

Personal lessons I did not expect

While wiring the urgency heuristic, I expected retrieval to dominate mistakes. Instead, I often found myself adjusting keyword lists because the language of urgency in synthetic tickets did not match the language of policy chunks. That mismatch reminded me that retrieval and rules interact; you cannot tune one in isolation forever.

Another surprise was how quickly the Rich table made the PoC feel “real.” Presentation is not substance, but human perception matters when you judge your own progress. I kept the table formatting minimal on purpose. Dense color in terminal output ages poorly and distracts from the routing story.

Finally, I was reminded how much I enjoy small JSON corpora. They are easy to diff in code review, easy to version, and easy to explain to someone unfamiliar with machine learning. If I had started with a database, I would have spent more time on migrations than on routing logic.

If you made it this far, thank you for reading carefully. I wrote this piece to document not only what the code does, but why I accepted certain limitations while rejecting shortcuts that would have made the demo flashier yet less honest.

Closing Thoughts

What I take away from this experiment is that “context engineering” is not only a prompt-design exercise. It is also an exercise in deciding what information deserves its own channel, how much structure to add before model calls, and how to leave an audit trail. The campus framing helped me keep those questions grounded.

If you try the code, I hope you modify the policy JSON and watch how retrieval shifts. In my experience, that kind of hands-on perturbation teaches more than reading another listicle about embeddings.

I also keep returning to a humbling point: good operations depend on people who answer phones, visit sites, and coordinate trades. Software can sort and summarize, but it cannot replace the embodied knowledge of how a specific building behaves in winter. This PoC stays modest because that human layer matters more than any script I would ship on a weekend.

As a final note, this article is an experimental write-up based on a hobby repository. It is not production guidance, not campus policy, and not affiliated with any organization I work with.

Disclaimer

Tags: python, context, machinelearning, campus

ShootMesh-AI: A Transparent “Production Office” for Staged Film-and-TV Days

Aniket Hingane — Thu, 02 Apr 2026 02:21:04 +0000

How I modeled department proposals, a merge policy, and an audit ledger without hiding the coordination rules

TL;DR

I built a small Python proof of concept called ShootMesh-AI that behaves like a miniature production office for a synthetic shooting day. Separate modules pretend to be scheduling, locations, safety, and equipment voices. Each one proposes actions when a staged incident appears. A coordinator applies a deterministic merge policy with an explicit priority order, and a ledger records what was chosen and how many minutes the plan slipped. The runnable code prints an ASCII table to the terminal and writes two charts to disk. From my perspective, the interesting part is not cleverness for its own sake; it is inspectability. I can read the policy, replay the ledger, and explain why a particular department “won” a merge at a particular step. The repository is public for learning, and the article is written as a personal experiment rather than as on-set guidance or employer-backed work.

Introduction

I have spent a fair amount of time thinking about how software can support coordination without pretending to be a substitute for human judgment. That tension shows up everywhere, but it is especially easy to picture on a location day where weather, permits, noise, gear faults, and human delays interact. I am not claiming that a script can run a set. What I am claiming, from my own tinkering, is that a disciplined simulation can help clarify what “coordination logic” even means when multiple subsystems disagree.

In my experiments, I wanted a structure that felt like a company in miniature: different functions propose, a single merge point decides, and a ledger preserves accountability. I wrote this article because I wanted to document the design choices I made while building ShootMesh-AI, the trade-offs I accepted, and the ways I would extend the idea if I kept pushing on it. I put it this way because I value narratives that show the reasoning path, not only the final file tree.

What's This Article About?

This article walks through an experimental architecture for multi-agent style coordination applied to a film and television production day scenario. The agents are not large language models in this repository. They are code modules with narrow responsibilities. That was intentional. I chose determinism so that two runs with the same code path produce the same ledger, which makes the merge policy testable without stochastic noise.

The storyline of the PoC is simple. A list of synthetic incidents arrives in order. For each incident, every department contributes one or more proposals. Each proposal carries a priority class, an estimated minute shift, and a confidence score used only for tie breaking under fixed rules. The coordinator selects a winner, records the outcome, and moves on. After the last incident, the program summarizes how often each department’s proposal won and plots cumulative slippage across the day.

I also discuss what this miniature setup suggests about real coordination systems: transparency, traceability, and the difference between a policy you can cite and a model output you can only guess about. As per my experience, no collaborative group is involved here; this is a solo build and a solo narrative.

Tech Stack

I kept the stack boring on purpose. The project is pure Python with matplotlib for charts and the standard library everywhere else. There is no database, no message broker, and no web server. I made that choice so the article could focus on coordination mechanics rather than infrastructure.

From where I stand, a small PoC should compile quickly in the reader’s mind. If I had introduced a queueing system and a container orchestrator, I would have spent more time explaining operations than explaining coordination. I may revisit richer infrastructure later, but this repository stays intentionally compact.

The diagrams in the article were rendered from Mermaid definitions into PNG images using the mermaid.ink approach, then checked into the repository so the README and the article can reference stable URLs. The animated GIF follows a terminal-first layout and then transitions into a simple bar chart; I generated it locally with Pillow using a single global color palette to avoid flicker on platforms that are picky about GIF encoding.

Why Read It?

If you are evaluating how to structure agent-like systems, you might be weighing opaque end-to-end models against explicit policies. This write-up shows a middle path: small agents propose, but the merge policy remains readable code. I think that pattern matters for debugging and for stakeholder trust, even in an experimental setting.

You might also read it if you want a concrete Python layout that separates types, scenario data, agents, coordination, reporting, and plotting. I separated those concerns because I did not want a single “god file” to become the hiding place for assumptions. In my opinion, the discipline of folders is part of the message.

Let's Design

I started by naming the boundaries. A production office is not one brain; it is a set of recurring conversations. Scheduling optimizes order and protects certain blocks. Locations negotiates physical reality and alternates. Safety refuses shortcuts when the refusal is non-negotiable. Equipment translates technical faults into workable mitigations. I did not try to encode every craft; I chose four to keep the PoC legible.

The coordinator encodes a value ordering that I can defend in prose: safety first, then schedule integrity, then creative accommodations, then cost-sensitive mitigations. That ordering is not universal. It is a hypothesis I used to make the demo behave sensibly when proposals conflict. The important part, in my view, is that the ordering lives in one place and can be changed without rewiring the entire codebase.

I also wanted a ledger that reads like an audit trail. Each step stores the incident text, the winning department label, and the minutes shifted by the chosen proposal. If I later attach identifiers for scenes or setups, the ledger format still works. I wrote the ledger as a list of structured entries rather than as unstructured log lines to keep downstream analysis simple.

The visuals mirror the architecture. The title diagram emphasizes the coordinator and ledger as the spine, with department agents feeding proposals inward. The architecture diagram compresses the runtime pipeline to a short chain: incidents, parallel proposals, priority sort, ledger append, artifacts. The sequence diagram is intentionally modest; it exists to show message-like flow even though everything runs in-process. The flow diagram captures the loop and the termination condition.

Let's Get Cooking

The runnable project lives here: https://github.com/aniket-work/ShootMesh-AI

Below I split the code at meaningful boundaries and explain what each block is doing in the narrative voice I used while building it.

Types and explicit priorities

I defined priorities as an enumeration and kept proposals immutable. That choice reduces the chance that a later step mutates a proposal accidentally. It also makes the merge logic easier to read because the priority tier is a first-class field rather than a buried string constant.

from enum import Enum
from dataclasses import dataclass


class Priority(str, Enum):
    SAFETY = "safety"
    SCHEDULE = "schedule"
    CREATIVE = "creative"
    COST = "cost"


@dataclass(frozen=True)
class Proposal:
    agent: str
    summary: str
    priority: Priority
    minutes_shift: int
    confidence: float

What this does: It establishes the vocabulary for proposals and forces every proposal to carry the fields the coordinator expects.

Why I structured it this way: I wanted the merge layer to depend on stable shapes. Frozen dataclasses made that contract obvious while I iterated on agent behaviors.

What I learned: When I first sketched the PoC, I used loose dictionaries. The code worked, but errors showed up late. Moving to typed objects caught mistakes earlier and made the article easier to write because the data model became self-documenting.

Coordinator merge policy

The coordinator ranks proposals using a tuple key that reflects my stated ordering. Safety should dominate when it appears. Within a tier, higher confidence wins. When confidence ties, the policy prefers smaller minute shifts because I wanted the demo to bias toward less disruptive fixes when all else is equal.

PRIORITY_ORDER: tuple[Priority, ...] = (
    Priority.SAFETY,
    Priority.SCHEDULE,
    Priority.CREATIVE,
    Priority.COST,
)


def choose_proposal(proposals: list[Proposal]) -> tuple[Proposal, str]:
    if not proposals:
        raise ValueError("no proposals")

    def sort_key(p: Proposal) -> tuple[int, float, int]:
        tier = PRIORITY_ORDER.index(p.priority)
        return (-tier, p.confidence, -p.minutes_shift)

    ranked = sorted(proposals, key=sort_key, reverse=True)
    best = ranked[0]
    rationale = (
        f"selected {best.agent} ({best.priority.value}) "
        f"with confidence {best.confidence:.2f}; "
        f"alternatives considered: {len(ranked) - 1}"
    )
    return best, rationale

What this does: It selects a single winning proposal and constructs a short human-readable rationale string for the ledger.

Why I structured it this way: I deliberately avoided machine learning here. The point of the PoC is to show a policy you can print on a whiteboard. If I had hidden the ranking inside a model, I would have undermined the story about transparency.

What I learned: Tie-breaking rules are never neutral. I had to document why minute shifts mattered to me as a tie breaker. Without that narrative, readers might assume the numbers were arbitrary when they were a deliberate preference for smaller disruptions.

Department agents with narrow expertise

Each agent subclass proposes only when it has something credible to say. For example, safety proposes hard stops for weather and gear faults, while scheduling proposes resequencing when talent is late. I avoided giving safety a generic fallback proposal on every incident because it caused safety to over-win merges in early versions of the PoC. That was a useful bug because it reminded me that “always-on” agents can drown out specialized voices if I am not careful.

class SchedulingAgent(DepartmentAgent):
    role = "scheduling"

    def propose(self, incident: Incident) -> list[Proposal]:
        if incident.kind == IncidentKind.TALENT:
            return [
                Proposal(
                    self.role,
                    "Slide block 2 after lunch; protect golden hour exterior",
                    Priority.SCHEDULE,
                    minutes_shift=25,
                    confidence=0.82,
                ),
            ]
        return [
            Proposal(
                self.role,
                "Re-sequence setups to absorb slip without dropping pages",
                Priority.SCHEDULE,
                minutes_shift=15,
                confidence=0.68,
            )
        ]

What this does: It returns one or more proposals for a given incident, scoped to scheduling’s perspective.

Why I structured it this way: I wanted the agents to feel like departments with partial visibility, not omniscient oracles. In a real system, each proposal might be backed by data from call sheets, travel estimates, or scout photos. Here, the proposals are illustrative, but the structure is what I expect to reuse.

What I learned: The difference between a demo and a toy is whether the extension points are honest. I left room to replace proposal text with retrieved facts later without rewriting the coordinator.

Engine and run summary

The engine loops incidents, gathers proposals, calls the coordinator, and accumulates ledger entries. It also computes a simple winner share per department label by counting wins and normalizing to one. That statistic is what feeds the bar chart.

def run_production_day(seed: int | None = None) -> RunSummary:
    _ = seed
    incidents = default_day_incidents()
    agents = all_department_agents()
    ledger: list[LedgerEntry] = []
    weights: dict[str, float] = {a.role: 0.0 for a in agents}
    slippage: list[int] = []

    for i, incident in enumerate(incidents, start=1):
        proposals = collect_proposals(agents, incident)
        entry = run_step(i, incident, proposals)
        ledger.append(entry)
        slippage.append(entry.resulting_shift_min)
        weights[entry.chosen_from] = weights.get(entry.chosen_from, 0.0) + 1.0

    total_w = sum(weights.values()) or 1.0
    agent_weights = {k: v / total_w for k, v in weights.items()}

    return RunSummary(
        total_incidents=len(incidents),
        total_slippage_min=sum(slippage),
        agent_weights=agent_weights,
        ledger=ledger,
    )

What this does: It executes the full synthetic day and returns both the ledger and aggregate metrics for plotting.

Why I structured it this way: I kept orchestration separate from proposal logic so I can test the coordinator against handcrafted proposal lists if I want targeted unit tests later.

What I learned: Even a short loop benefits from clear naming. When I wrote the first draft, I inlined too much and struggled to explain it in prose. Pulling collect_proposals and run_step into helpers made the article easier to write and the code easier to read.

Entry point and artifacts

The main.py script prints a banner, renders the ASCII table, and writes PNGs into output/. I kept console output friendly because I knew the GIF would lean on the terminal aesthetic.

summary = run_production_day(seed=args.seed)
rows = ledger_to_rows(summary.ledger)
table = ascii_table(["step", "department", "shift_min", "incident (truncated)"], rows)
print(table)
plot_agent_influence(
    summary.agent_weights,
    os.path.join(out_dir, "agent_influence.png"),
)
plot_slippage_timeline(
    [e.resulting_shift_min for e in summary.ledger],
    os.path.join(out_dir, "slippage_timeline.png"),
)

What this does: It turns the run summary into human-readable terminal output and chart files.

Why I structured it this way: I wanted a single command that a reader could run after clone, with no configuration step. That lowers friction for experimentation.

What I learned: ASCII tables photograph well in articles and GIFs. That sounds trivial, but visual continuity matters when you are trying to show a serious PoC rather than a loose notebook.

Let's Setup

Step by step details can be found at:

Clone the repository from https://github.com/aniket-work/ShootMesh-AI.git
Create a virtual environment in the repository root using python3 -m venv venv
Activate the environment using your platform’s standard activation command
Install dependencies with pip install -r requirements.txt
Run python main.py and confirm that output/ contains PNG charts

I deliberately do not require API keys or cloud accounts for the demo. In my opinion, that constraint is a feature for readers who want to reproduce results on a laptop without standing up services.

Let's Run

When I run the program locally, I expect to see a table with five rows for the default scenario and a total slippage figure derived from the chosen proposals. I also expect the charts to reflect which departments won merges most often. In the version I pinned while writing, safety wins more frequently than others because the staged day includes several incidents where safety proposals legitimately sit at the top of the priority ordering. That outcome is not a claim about real-world frequency; it is a consequence of the synthetic mix I authored.

If you compare the terminal output to the ledger logic, you should see a direct mapping. That traceability is the point. I am not asking the reader to trust a hidden score; I am asking the reader to inspect the policy and the ledger together.

Edge cases I considered while iterating

Even a toy coordination loop has edge cases worth mentioning. Empty proposal lists are invalid for a decision step, so the coordinator raises rather than silently picking a winner. If I ever model a department outage, I need an explicit representation of “no proposal” rather than an empty list that crashes the merge. I would also add typed incident identifiers so repeated text does not accidentally collapse distinct events in analytics.

Another edge case is conflicting notions of “confidence.” In this PoC, confidence is a scalar used for tie breaking within a priority tier. It is not calibrated probability. If I elevated this experiment, I would separate calibrated likelihoods from policy weights, and I would document the difference aggressively to avoid misinterpretation.

Deeper dive on policy design and alternatives

When I began sketching ShootMesh-AI, I considered three broad approaches before settling on the explicit priority stack I implemented. The first approach was a single scoring function that combined every attribute into one number. It would have been compact, but it would also have hidden the moral and operational precedence that I wanted to discuss in plain language. The second approach was a rule engine with dozens of special cases. I rejected that for the PoC because it would have looked like accidental complexity rather than principled structure. The third approach, which I adopted, was a small ordered list of priority classes with deterministic tie breaks inside each class.

From my perspective, the third approach mirrors how a production office often argues: first establish what is non-negotiable, then argue schedule integrity, then discuss creative compromises, and only then talk about money-saving shortcuts. Reality is messier than code, but the metaphor helped me keep the modules coherent. I also appreciated that readers could disagree with my ordering and still understand exactly what to change.

I thought about adding dynamic weights that shift depending on the day’s risk profile. That would be interesting, but it would also introduce a second story about how those weights are chosen and audited. I decided that a static policy produced a cleaner narrative for a first public version. If I revisit the idea, I would likely expose weights as data loaded from a file with version metadata so runs can be tied to a named policy document.

Observability, replay, and the value of a boring ledger

One lesson I keep relearning in my experiments is that the most valuable artifact is often not the final chart but the sequence of decisions that produced it. The ledger in ShootMesh-AI is small, yet it already supports a basic kind of replay: you can read the incident text, see which department label won, and connect that to the minute shift recorded for the step. In a larger system, I would add correlation identifiers and timestamps, but I would still keep the ledger append-only in spirit.

Replay matters because coordination bugs are often stories told incorrectly. If a plan looks wrong, you want to know whether the inputs were wrong, whether a proposal was malformed, or whether the merge policy did exactly what it promised and the surprise is simply an uncomfortable trade-off. In my opinion, separating those explanations is half the battle.

I also considered emitting structured JSON alongside the ASCII table. I did not add it to the repository because I wanted to keep the surface area small, yet the design naturally extends in that direction. A JSON stream would make it easier to feed downstream dashboards without parsing terminal output, and it would still preserve transparency if the JSON schema is documented.

What I deliberately did not build

Scope discipline was important. I did not build a web UI, a mobile app, or a collaborative editor. I did not integrate calendars, maps, or real-time messaging. I did not connect to vendor APIs for equipment rental houses or location databases. None of those omissions reflect a belief that they are unimportant; they reflect a desire to keep the PoC intellectually honest about what is being validated.

I also did not build a learned coordinator. There are fascinating research directions where a model learns to mimic human merge decisions from historical days. That could be powerful, but it would shift the article’s emphasis away from inspectability and toward data collection and evaluation methodology. I wanted the first public version to stand on deterministic logic so a reader could diff two commits and understand behavioral changes.

Reproducibility, randomness, and the seed parameter

The engine accepts a seed argument for symmetry with common machine learning utilities, but the current scenario path is deterministic. I kept the parameter because I anticipate a future where proposals might be sampled or perturbed during stress tests. In the present code, the seed is a quiet placeholder. I am mentioning it explicitly so nobody assumes hidden randomness where none exists.

When I add stochastic elements later, I would log the seed alongside the policy version in the ledger header. That pattern has served me well in other experiments because it closes the loop between “interesting run” and “parameters I can share.”

Naming, mental models, and why “agent” is overloaded

The word “agent” means too many things in 2026. In this repository, an agent is simply an object with a propose method and a role label. It does not imply autonomy in the sense of long-lived goal pursuit. It does not imply a large language model. I picked the term because it matches the mental model of departments proposing alternatives.

If I renamed everything to “module” or “function,” the architecture would look flatter and more boring, which might be technically accurate but less helpful for readers who think in terms of organizational roles. Language shapes how we reason about responsibility. I kept the agent language while trying not to mystify it.

Failure modes that showed up in early drafts

In an early draft, safety proposals appeared for every incident with a small minute shift. The merge policy correctly prioritized safety, which meant safety “won” too often and the chart looked unrealistic. I adjusted the safety agent so it only proposes when the incident category genuinely intersects safety-critical concerns. That change produced a more balanced distribution of winners while still respecting hard stops for weather, gear faults, and curfew-style constraints.

Another failure mode was ambiguous tie breaks when two proposals shared priority and confidence. Sorting with reverse=True on a tuple required careful attention to signs. I chose to prefer smaller minute shifts when higher-priority fields tie because that matched my narrative about minimizing disruption. If I inverted that sign by mistake, the demo still ran, but the behavior would silently favor larger slips. That is exactly the kind of bug a ledger helps catch: you notice unreasonable minute shifts and trace backward.

Performance characteristics and why I barely mention them

The PoC runs fast. There is no network I/O, no database, and no heavy numerical work beyond chart rendering. I mention performance not to boast but to set expectations: this article is about coordination clarity, not throughput engineering. If I scaled the idea to thousands of incidents per day with rich attachments, I would profile proposal generation and likely introduce caching for repeated subproblems.

For now, the performance story is intentionally dull. Dull can be good when the goal is to focus attention on semantics.

Broader parallels I see in other industries

While I anchored the story in film and television production because it makes the role separation vivid, the same structural pattern appears whenever multiple specialists contribute competing recommendations under time pressure. In my observation, software architecture reviews, incident response rotations, and even large event logistics share a family resemblance: proposals, merges, and records.

The ShootMesh-AI PoC is not a universal engine for those domains. The incident taxonomy is narrow, and the proposals are synthetic. Still, the scaffolding transfers: define proposals with explicit priority classes, centralize merge logic, and persist decisions in a ledger that humans can read without specialized tooling.

How I would test this more seriously

If I invest more time, I would add targeted unit tests for choose_proposal with crafted lists that isolate tie breaks. I would add scenario tests that feed a full day and assert golden ledger outputs for a pinned policy version. I would also add a property test that randomly permutes proposal order within a step to verify the merge result is order-invariant aside from true ties.

Testing proposals from each agent independently would be worthwhile too. That would require extracting sample incidents and expected proposal shapes into tables. The effort would pay off if the codebase grows more elaborate proposal generators.

A note on documentation and the README

I wrote the README to stand alone for visitors who arrive from the repository before they read any article. That meant mirroring the diagrams, describing setup with exact commands, and stating ethical limits plainly. From my experience, open repositories benefit from a crisp boundary between “what this is” and “what this is not.” I tried to draw that boundary without sounding defensive.

Personal reflection on solo work and scope

I built this alone, and that choice shaped the trade-offs. A broader collaboration might have pushed for richer scenarios, more agents, or a polished UI sooner. Working solo let me control the narrative and keep the code small enough to explain in one article. In my view, that trade-off is acceptable for an experimental artifact meant to communicate an idea rather than ship a product.

Expanded commentary on charts and what they do not prove

The bar chart of merge outcomes is descriptive, not normative. It answers a narrow question: under this synthetic scenario and this policy, which department labels tended to win. It does not say that safety should win that often in real life, and it does not estimate the quality of creative output. The cumulative slippage line is also a synthetic curve. It helps visualize how minute shifts stack across incidents, but it does not model parallel work streams or overlapping crews.

I included the charts because the instructions I follow for my publishing experiments emphasize statistical summaries as a discipline. Even when the numbers are toy numbers, the practice of summarizing runs prevents a purely anecdotal story. I appreciate that constraint even when it adds work.

Closing the loop with the public repository

The code and images live in a public GitHub repository so you can verify claims directly. If a paragraph in this article ever drifts from the code, the repository becomes the source of truth. I prefer that arrangement over an article that only talks about code without pointing to a commit you can inspect.

When you open the repository, start with main.py if you want the fastest path to behavior, then read coordinator.py if you want the merge policy in one screen, then skim agents.py if you want to see how proposals vary by incident kind. That reading order mirrors how I explain the project verbally.

A longer look at why transparency beats mystery in coordination demos

There is a genre of demo where a system appears intelligent because it produces a slick recommendation without showing intermediate reasoning. That can be impressive in a keynote. It is less helpful when you are trying to teach someone how to engineer coordination responsibly. I wrote ShootMesh-AI in the opposite style. The coordinator is short, the proposals are objects, and the ledger is readable. If the behavior looks wrong, you have a chance to pinpoint why.

In my opinion, that style aligns with how many engineers prefer to learn. Show the data structures, show the decision rule, show the log line. Then discuss improvements with specificity rather than vibes.

If you fork it, here is what I hope you change

Forks are welcome in spirit even if I do not maintain a community process around this repository. If you fork it, I hope you rename the incident taxonomy to match a domain you understand deeply. I hope you rewrite proposals so they cite constraints that matter in your world. I hope you adjust the priority order thoughtfully and document why. I hope you keep a ledger-like artifact even if you change everything else.

Narrative honesty about limitations

I want to repeat plainly that this is a simulation. It does not capture the emotional load of a day running long. It does not capture union rules, child actor hours, or vendor contracts. It does not capture the creative discussion between a director and a cinematographer about whether a compromise is artistically acceptable. Those omissions are not oversights; they are outside the PoC’s scope.

The value, as I see it, is in the skeleton. Skeletons are easy to criticize for being incomplete, but they are also easier to improve than a monolith that tries to be everything on day one.

Additional remarks on tooling choices for diagrams and animation

I used Mermaid because it is text-first and reviewable in pull requests. I used mermaid.ink to render PNGs because it keeps the article and README consistent without requiring readers to install diagram tooling. I generated the GIF locally because I wanted control over palette behavior and terminal styling. Each choice prioritized reproducibility and platform compatibility over novelty.

What “success” meant for this PoC

I considered the PoC successful when I could run one command, see a legible ledger in the terminal, see charts that matched the ledger, and explain the merge policy without referring to hidden state. That bar sounds low, yet it eliminated a surprising number of flashy but opaque designs I tried first.

A final note on writing style and intent

I varied first-person phrasing on purpose. I wanted the article to read like a practiced engineer narrating decisions, not like a marketing page. I avoided implying employer context or production guarantees. I kept the focus on what I built, why I built it, and how you can inspect it.

Ethics, labor, and authority

Coordination software for creative work sits near sensitive realities: labor rules, safety jurisdiction, and creative authority structures. I am writing this article as a personal experiment, but I still want to state plainly that no repository should override qualified humans on set, and no simulation should be mistaken for compliance tooling. The PoC does not ingest personal data about performers or crew. It uses synthetic strings. Any future version that touches real schedules or personal information would require privacy review and clear data handling practices that go beyond this write-up.

Future roadmap from my notebook

If I continue, I would explore the following directions:

Replace static proposal text with retrieval over structured call-sheet data for a single fictional production, still without real people’s information.
Add a second merge layer that models explicit director or producer overrides with ledger annotations so policy exceptions remain traceable.
Introduce lightweight property tests that randomize incident order within a day to see if implicit assumptions hide in the scenario list.
Export the ledger to a small SQLite file for longer runs, keeping the same schema principles.

None of that is promised. It is a candid list of what I would try next if I kept the experiment alive.

Closing Thoughts

I built ShootMesh-AI because I wanted a hands-on way to think about multi-agent coordination as a set of proposals under a visible policy, not as a single opaque recommendation. From my observation, the hardest part was not typing Python; it was deciding what “winning” means when values genuinely conflict. The code forced me to be honest about trade-offs. The ledger forced me to accept accountability for those trade-offs in a repeatable record.

In my opinion, that combination is portable beyond the film and television metaphor. Any domain with competing constraints can benefit from the same separation: specialized proposal generators, an explicit merge policy, and an append-only narrative of decisions. I am sharing this experiment in public so others can reuse the structure, critique the policy, or replace the scenario data with something closer to their own world.

As per my experience, the most useful PoCs are the ones you can explain without hand-waving. I hope this article reads that way.

Disclaimer

Tags: python, agents, coordination, ai

Policy-Locked Triage for Messy Citizen Text: A Municipal-Style Routing PoC with SFT and Preference Alignment

Aniket Hingane — Wed, 01 Apr 2026 02:28:08 +0000

How I stabilized noisy 311-style requests with supervised training and reviewer preferences in Python

TL;DR

This write-up is an experimental account of how I built a small routing proof of concept for synthetic municipal-style service requests. The goal was not to ship a city-wide system. From my perspective, the interesting part is the training story: start with labeled text, fit a transparent classifier, then inject reviewer-style preferences so the policy moves toward routes that match operational nuance. The repository is public, fully synthetic, and designed to run on a laptop without calling a hosted large language model. If you are looking for a polished civic product, this is not it. If you are looking for a clean, inspectable playground that mirrors how I think about aligning lightweight agents before any serious conversation about production, this article walks through the motivation, design, code, and limitations in depth.

Introduction

I have spent a fair amount of time thinking about the gap between a clever prompt and a dependable workflow. Prompts can feel magical until the edge cases arrive, and edge cases are exactly what public-facing intake systems collect. In my opinion, the hardest part is not the first eighty percent of routing accuracy on obvious phrases. The hardest part is the long tail where departments disagree, citizens mix multiple issues in one sentence, and the right answer depends on local policy interpretation rather than dictionary matching.

This article documents a solo experiment. I wrote the code, generated the synthetic corpus, and iterated on evaluation plots in isolation. Nothing here reflects a real municipality, a vendor engagement, or a production deployment. I am describing a personal proof of concept that helped me reason about supervised fine-tuning style steps and preference alignment style steps without pretending I trained a billion-parameter model on private data.

I also want to be clear about scope. I am not claiming breakthrough accuracy numbers on messy real-world corpora. The dataset is intentionally templated so I can focus on architecture and training mechanics. As per my experience, that trade-off is common in early research spikes: control the data so you can see whether your training loop behaves, then worry about realism later.

What is this article about?

The narrative centers on a routing policy that maps free-text citizen requests to a small set of department labels such as streets, parks, utilities, code enforcement, noise, and a catch-all other bucket. The policy is a multinomial logistic regression model on TF-IDF features. That choice is boring on purpose. Boring models are easy to explain, easy to diff between training iterations, and easy to pair with simple charts when I need to communicate results to someone outside machine learning circles.

The second thread is alignment in a practical, scaled-down sense. Large-scale preference optimization methods are fascinating, but they also come with engineering overhead that does not belong in every story. In my experiments, I approximate reviewer intent by augmenting the training set with duplicated examples that emphasize the chosen route and sparse contrastive hints that steer the model away from plausible wrong routes. The approximation is not a faithful implementation of direct preference optimization. It is a teaching device that still captures the intuition I care about: policies improve when human judgments are folded back into training data in a structured way.

Tech stack

Python 3.10 or newer for broad compatibility with current scientific libraries.
scikit-learn for TF-IDF vectorization and multinomial logistic regression.
NumPy for lightweight numerical handling during evaluation.
Matplotlib for offline charts that summarize accuracy and macro F1 movement between training phases.
A small amount of standard library code for ASCII tables in the terminal, because I wanted the demo to feel credible when I record or share a terminal capture.

I deliberately avoided deep learning frameworks in this repository. That decision is philosophical as much as technical. In my opinion, a public write-up about routing should let a curious reader inspect coefficients and vocabulary without downloading CUDA drivers. If I later extend the project with embeddings or small transformer heads, that can be a separate milestone with separate disclosure about compute and data handling.

Why read it?

If you are evaluating how to stage agent development, you might appreciate a story that separates policy training from prompt drafting. I structured the code so the “agent” is really a policy object with predictable inputs and outputs.
If you care about reproducibility, the synthetic generator and fixed seeds make runs comparable across machines, which is helpful when you are sanity-checking a pipeline before investing in data contracts.
If you are interested in alignment discussions but want a concrete anchor, the preference augmentation section translates abstract pairwise feedback into a dataset transformation you can read line by line.
If you want a reminder about ethics and privacy, the article ends with a candid discussion of why synthetic data is the responsible choice for a public artifact in this domain.

Let us design

Before typing code, I sketched the constraints I wanted the PoC to respect. First, the system should fail gracefully in the sense that every prediction returns a label with an interpretable basis in term frequency. Second, training should be fast enough that I can iterate during a single evening session. Third, the evaluation should include more than accuracy because class imbalance can hide weakness in rare departments.

Architecture-wise, I imagined three cooperating layers: ingestion of normalized text, featurization, and a training loop that can run in two phases. The first phase is ordinary supervised training on labeled examples. The second phase reweights and augments the corpus using preference pairs that represent reviewer corrections. The diagram below highlights how I think about the information flow at a high level.

I also wanted a sequence-oriented view because stakeholders often think in terms of tickets rather than matrices. The sequence chart is simplified, yet it captures the idea that routing is a service interface problem, not only a modeling problem.

Finally, I drew a training flowchart to keep myself honest about order of operations. When I build without a flowchart, I tend to mix evaluation leakage into augmentation steps by accident. The flowchart is a personal guardrail.

Let us get cooking

The heart of the project lives in a handful of modules under src/civic_triage. Rather than dump the entire repository into this article, I am highlighting the pieces that taught me the most while implementing the experiment.

Module: labels and synthetic corpus

I started by fixing a small enumeration of departments. Keeping labels explicit avoids silent typos that destroy evaluation integrity. The synthetic generator fills templates with street names, cross streets, and park names drawn from small pools. That approach introduces variety without requiring personally identifiable information.

from enum import Enum


class Department(str, Enum):
    STREETS = "streets"
    PARKS = "parks"
    UTILITIES = "utilities"
    CODE_ENFORCEMENT = "code_enforcement"
    NOISE = "noise"
    OTHER = "other"

This code is almost too simple to discuss, but that is the point. By constraining labels to an enum, I make downstream encoding and reporting consistent. When I wrote this, I was thinking about future refactors: if a label changes, I want a single source of truth rather than string literals scattered across scripts.

The synthetic data builder rotates through templates per department and adds occasional suffix phrases such as “Please route quickly.” Those suffixes inject light noise so the vectorizer cannot rely on a single memorized sentence. In my opinion, small perturbations matter when evaluating linear models because they reveal whether the model leans on a handful of accidental keywords.

Module: preference pairs

Preference pairs are where I tried to echo alignment ideas without importing a full preference optimization stack. For a subset of training rows, I simulate a reviewer disagreeing with a plausible wrong route. The chosen label remains the ground-truth department, while the rejected label is sampled from a hand-authored confusion map.

def iter_preference_pairs(requests, mistake_rate: float = 0.22, seed: int = 7):
    rng = random.Random(seed)
    confusion = {
        "streets": ("utilities", "noise"),
        "parks": ("noise", "other"),
        # ... additional mappings ...
    }
    for lr in requests:
        if rng.random() > mistake_rate:
            continue
        wrong_a, wrong_b = confusion[lr.label]
        rejected = rng.choice((wrong_a, wrong_b))
        yield PreferencePair(text=lr.text, chosen=lr.label, rejected=rejected)

I put it this way because I wanted the mistake rate to be explicit. If the rate is too high, augmentation dominates and can distort the base distribution. If the rate is too low, the second training phase barely differs from the first. In my experiments, a mid-teens to low-twenties rate produced visible dataset growth without drowning the original signal.

Module: modeling and alignment helper

The classifier pipeline combines TF-IDF with multinomial logistic regression. I kept regularization in a sensible default range and allowed sklearn to pick the multiclass strategy appropriate for the installed version. The alignment helper duplicates chosen-label rows multiple times and occasionally appends a short textual hint that reinforces the negation of a rejected route.

def apply_preference_alignment(base_texts, base_labels, pairs, oversample_chosen: int = 3, seed: int = 42):
    rng = np.random.default_rng(seed)
    texts = list(base_texts)
    labels = list(base_labels)
    for text, chosen, rejected in pairs:
        for _ in range(oversample_chosen):
            texts.append(text)
            labels.append(chosen)
        if rng.random() < 0.15:
            texts.append(text + " [reviewer_note: not " + rejected + "]")
            labels.append(chosen)
    return texts, labels

When I wrote this, I was thinking about how real reviewers often repeat themselves when they correct a mistake. Duplication is a crude stand-in for importance weighting, but it behaves well with linear models and keeps the code approachable for readers who are not ready to implement custom loss functions.

Module: reporting

I wanted terminal output that looks like a serious batch job. ASCII tables are not glamorous, yet they photograph well in articles and presentations. The reporting helper measures column widths and draws horizontal rules with plus signs, similar to old-school fixed-width reports.

def ascii_table(headers: list[str], rows: list[list[str | float]]) -> str:
    str_rows: list[list[str]] = [[str(h) for h in headers]]
    for r in rows:
        str_rows.append([f"{c:.4f}" if isinstance(c, float) else str(c) for c in r])
    widths = [max(len(row[i]) for row in str_rows) for i in range(len(headers))]
    lines: list[str] = []
    sep = "+-" + "-+-".join("-" * w for w in widths) + "-+"
    lines.append(sep)
    # ... render rows ...
    return "\n".join(lines)

This block taught me to separate presentation from computation. Metrics are computed once, then rendered. That separation makes it easier to swap the renderer later if I decide to integrate Rich or another library without touching training logic.

Entry point: orchestration

The main.py script wires everything together: generate data, split, train the supervised model, evaluate, build preference pairs, augment, retrain, and write charts. I kept the CLI small so the experiment stays legible.

def run_pipeline(seed: int, n_per_class: int, pref_mistake_rate: float) -> int:
    data = generate_labeled_requests(n_per_class=n_per_class, seed=seed)
    split = int(len(data) * 0.8)
    train, test = data[:split], data[split:]
    train_texts = [x.text for x in train]
    train_labels = [x.label for x in train]
    test_texts = [x.text for x in test]
    test_labels = [x.label for x in test]
    sft = fit_sft(train_texts, train_labels, seed=seed)
    sft_metrics = metrics_for(sft, test_texts, test_labels)
    pairs = list(iter_preference_pairs(train, mistake_rate=pref_mistake_rate, seed=seed + 1))
    pair_tuples = [(p.text, p.chosen, p.rejected) for p in pairs]
    aug_texts, aug_labels = apply_preference_alignment(train_texts, train_labels, pair_tuples, oversample_chosen=3, seed=seed + 2)
    aligned = fit_sft(aug_texts, aug_labels, seed=seed + 3)
    aligned_metrics = metrics_for(aligned, test_texts, test_labels)
    plot_metric_bars(sft_metrics, aligned_metrics, os.path.join(ROOT, "output", "metrics_compare.png"))
    return 0

Reading this after a break, I still like the explicit ordering. Augmentation happens only after the base model exists, which prevents me from accidentally comparing two augmented variants without a common baseline story.

Let us setup

Step by step details can be found in the repository README, and the canonical clone URL is https://github.com/aniket-work/CivicTriage-AI. I recommend creating a virtual environment inside the project folder so dependencies remain isolated from other work on the same laptop. On my machine, the setup sequence looks like the following.

Clone the repository to a working directory of your choice.
Create a virtual environment with python3 -m venv venv and activate it.
Install dependencies with pip install -r requirements.txt.
Run python main.py with default flags to verify charts appear under output/.

I also keep generated charts copied into images/ for documentation continuity. That step is not strictly required for execution, but it helps keep the README and article visuals aligned.

Let us run

When the pipeline runs successfully, the terminal prints ASCII tables comparing the supervised phase and the alignment-augmented phase. On my synthetic split, metrics often look strong because the dataset is separable by design. That outcome is useful for debugging plumbing, but it is not a claim about real civic text. Interpreting results responsibly matters more than chasing a flashy number.

The charts compare accuracy and macro F1 between phases. Macro F1 is particularly important when class counts differ, because accuracy alone can hide poor performance on rare labels.

Label distribution visualization is another sanity check. If one department dominates unexpectedly, I know to revisit sampling before trusting any headline metric.

Theory interlude: why linear models still deserve respect

It is tempting to assume that only large models deserve the label “agent.” In my opinion, that assumption mixes capability with agency. A small linear policy can still be embedded inside a broader agentic system that handles tool calls, retrieval, and escalation. The routing policy in this PoC is a single decision node, not the entire automation story. Thinking about nodes separately helps me reason about failure isolation. If routing fails, I can swap the node without rewriting unrelated orchestration code.

From a mathematical angle, multinomial logistic regression estimates a convex problem under typical regularization assumptions. Convexity does not guarantee perfect generalization, but it does provide a stable training baseline when compared with some deep model training loops that require careful tuning of learning rates and batch sizes. Stability matters when you are iterating nightly on a side project without a cluster.

Edge cases I worry about even in a toy setup

Multi-issue messages that require splitting or multi-label prediction. This PoC uses single-label classification only.
Language diversity and informal spelling. The synthetic generator uses English templates with light noise, not multilingual corpora.
Seasonal effects such as leaf pickup schedules or snow removal windows that change routing rules over time.
Equity concerns when certain neighborhoods file more tickets simply because access channels differ. A routing model can inherit those structural biases if trained blindly.

None of those issues disappears because the code is short. They are reminders that a serious path forward requires collaboration with domain experts and ongoing monitoring.

Ethics and data handling

I chose synthetic data because citizen text can include names, addresses, and medical references even when the intake form is labeled as non-emergency. Public repositories are the wrong place for that material unless there is a rigorous governance process. In my experiments, synthetic templates let me discuss routing ideas without crossing privacy boundaries. If I ever move toward real data, I would treat consent, retention limits, and redaction as prerequisites rather than afterthoughts.

Future roadmap for myself

Introduce a calibration layer so probability outputs map more reliably to operational thresholds.
Explore multi-label classification for compound requests, likely with a different architecture than plain multinomial logistic regression.
Add a human review queue simulation that measures how often uncertain predictions would escalate.
Experiment with character n-grams or lightweight embeddings while keeping the repository easy to run on CPU hardware.

Deeper notes on TF-IDF and why I still reach for it

Term frequency-inverse document frequency is not new. In my opinion, that is a feature rather than a flaw when you are writing about routing policies that must be explained in a public meeting. TF-IDF highlights discriminative words relative to the corpus without requiring GPU memory. It pairs naturally with linear models, and linear models yield coefficients that can be inspected if someone asks why a particular ticket leaned toward utilities instead of streets.

I also appreciate the control it gives me over n-gram breadth. In this PoC, I allowed unigrams and bigrams through the vectorizer configuration in the modeling module. Bigrams capture short phrases such as “dog off leash” that unigrams might fragment. The trade-off is a larger feature space and a higher chance of spurious bigrams if the dataset is tiny. Because I generated hundreds of rows per class, the bigram signal remained reasonably stable across random seeds in my local tests.

There is a honest limitation: TF-IDF does not understand paraphrase. If a citizen writes “hydrant leaking” versus “fire plug dripping,” the model might treat those as unrelated unless the training distribution includes both phrasings. In a real program, I would expect continuous vocabulary drift and periodic retraining. For this experimental article, I accepted that limitation and focused on making the training loop legible.

What I mean by supervised fine-tuning in this context

When people say “supervised fine-tuning” around large language models, they usually mean updating many parameters on instruction data. Here, the phrase is intentionally more literal: supervised training of a classifier head on labeled examples. I use the SFT language because the staged story mirrors how larger agent stacks are discussed, even though the parameter count is tiny.

The staging matters psychologically. In my experience, separating a baseline fit from a later refinement step helps me debug where a regression was introduced. If the second stage suddenly collapses accuracy, I know to inspect augmentation or duplication rates rather than the raw tokenizer. That kind of isolation is harder when everything happens inside one opaque fine-tuning run.

Preference alignment without a full DPO implementation

Direct preference optimization and related algorithms deserve their place in the research landscape. They also deserve a caution label for small solo projects that cannot afford extensive hyperparameter sweeps. I chose a transparent approximation: duplicate chosen labels, sprinkle in occasional negation hints, and refit the classifier. The goal is not to reproduce a paper result. The goal is to capture the intuition that pairwise feedback shifts the decision boundary.

If I squint, the augmentation resembles importance sampling toward reviewer-approved actions. If I squint less generously, it is just oversampling. Both perspectives are useful. The first perspective keeps me aligned with how I talk about learning from feedback in agent design. The second perspective keeps me humble about claims I can make in a public write-up.

Evaluation choices and why macro averaging matters

Accuracy is a convenient headline number, but it can lie when classes are imbalanced. Macro-averaged F1 computes metrics per class and averages them, giving rare departments a louder voice in the aggregate. In civic routing, rare classes are often the ones with the highest operational risk if misrouted.

I also log both phases on the same holdout split to avoid accidental optimism from resampling the test set. In my experiments, holding the test set fixed while changing training augmentation is the minimum bar for a fair comparison. I mention this because it is easy to cheat yourself with a sloppy split when synthetic data feels harmless.

Failure modes that showed up when I stress-tested my assumptions

Even with templated text, I found ways to break my own mental model. For example, if I lowered the number of examples per class too far, variance spiked and the confusion matrix looked ugly in ways that were not instructive, only noisy. If I pushed the preference pair rate too high, training time grew and the model began to overemphasize duplicated rows unless I watched the oversampling multiplier.

Another failure mode is more human: if I describe this PoC to someone as “AI that solves 311,” I am overselling it. Language shapes expectations. I prefer to describe it as a routing policy prototype with explicit training stages and visible metrics. That framing keeps the conversation grounded.

Monitoring and observability in a hypothetical next phase

If I were evolving this into a supervised pilot rather than a local script, I would want basic monitoring hooks even before considering fancy agents. At minimum, I would track label distribution over time, confidence histograms per department, and a sample of low-confidence predictions for manual review. None of that requires deep learning. It requires discipline.

I would also log model versions next to dataset hashes. In my opinion, reproducibility is part of safety. When someone asks why routing changed in March, I want to point to a dataset diff and a training configuration diff, not a shrug.

Security considerations for intake interfaces

Routing is only one layer. Real systems must handle authentication for staff tools, rate limits for public endpoints, and prompt-injection-like attacks where a citizen pastes instructions meant to confuse downstream automation. This PoC does not implement those protections. I am mentioning them because a public article about civic automation should not pretend the model exists in a vacuum.

Accessibility and channel fairness

Not everyone files a request through the same channel. Phone, web, and mobile apps produce different kinds of text and different kinds of errors. A model trained predominantly on web forms might underperform on transcribed phone calls. I did not simulate those channels separately in this repository. From my perspective, that is a future split worth modeling if the goal ever stops being educational.

Comparing this PoC to large-model fine-tuning stories

The high-level arc rhymes with bigger systems: train a policy, incorporate human preference signals, evaluate. The differences are scale, compute, and representation. Large models can generalize across phrasing with fewer explicit templates. Small models can be audited with a spreadsheet mindset. I am not arguing one replaces the other. I am arguing that practicing the arc at small scale sharpens the questions I ask when I read about large-scale alignment work.

What I would measure in a more realistic dataset pilot

If I ever graduate beyond synthetic templates, I would start with offline metrics on redacted logs, then move to shadow mode where predictions are logged but not acted upon, and only then consider limited automation with human escalation paths. Each gate exists to reduce the risk of silent harm. The sequencing is more important than any particular classifier architecture.

Personal reflection on solo experimentation

Working alone on this kind of spike has advantages and drawbacks. The advantage is speed. I can rename a module on a whim without coordinating across roles. The drawback is blind spots. I compensate by writing diagrams, running the same script under multiple seeds, and documenting limitations aggressively. That discipline does not eliminate bias, but it reduces the chance that I mistake a tidy synthetic world for the messiness of real operations.

Additional notes on plotting and communication

Charts are not ornamentation here. They are a contract with the reader. When I compare two training phases side by side, I force myself to confront whether the second phase genuinely moved the metrics I claim matter. If the plot looks flat, I do not hand-wave. I explain why flatness might be acceptable, such as a separable dataset where both models saturate, or I revisit the augmentation recipe.

Reproducibility checklist I used while preparing this article

Pin random seeds in data generation and model fitting.
Keep the evaluation split stable across training variants.
Store charts as files so visual results are reviewable without rerunning.
Avoid nondeterministic operations where possible, and accept that some BLAS operations may still introduce tiny drift.

How I think about versioning datasets and models together

One habit I picked up from earlier experiments is to treat datasets like code. If I change a template string in the synthetic generator, I should think of that as a dataset version bump even when the Git commit message talks about “just a wording tweak.” Small wording changes can shift term frequencies enough to alter which n-grams dominate. In a toy project, the stakes are low. In a pilot, the stakes are higher because departments may rely on trend lines that assume comparable distributions over time.

When I snapshot metrics, I try to record not only accuracy and macro F1 but also the training row counts before and after augmentation. Row counts tell a story about how aggressively preference duplication reshaped the effective loss landscape. If the augmented dataset balloons by an order of magnitude, I expect different regularization needs even within linear models.

Calibration and confidence: what I wish I added sooner

Probabilities from logistic regression can be overconfident, especially when features separate cleanly. I did not implement Platt scaling or isotonic regression in this repository because I wanted to keep the first iteration narrow. Looking back, a calibration section would make the PoC more instructive for readers who want to map scores to “send to human review if below threshold” workflows. That mapping is where many real systems spend their engineering time.

If I add calibration later, I would hold out a separate calibration split to avoid information leakage from evaluation metrics. The distinction sounds pedantic until you realize how easy it is to accidentally tune thresholds on the same rows you report as performance.

Narrative lessons from building the terminal output

The ASCII table formatting took longer than expected relative to its mathematical complexity. That is common when polish matters. I wanted the output to resemble a batch report because, in my experience, stakeholders trust artifacts that look like logs they already read. A wall of unstructured print statements signals hobby project. A bordered table signals intentionality.

The same principle applies to README quality. A repository with crisp diagrams and a clear run command earns attention in a way that scattered scripts do not. I am not claiming aesthetics replace correctness. I am claiming clarity reduces friction when someone else tries to reproduce your work months later.

What I would test if I add unit tests in a follow-up commit

Label integrity: every generated label must be a member of the known department enumeration.
Deterministic splits: with a fixed seed, the train and test partitions should be identical across runs.
Metric sanity: accuracy should fall between zero and one, and macro F1 should not exceed one.
Augmentation invariants: preference augmentation should never drop rows below the base training size unless explicitly intended.

Tests like these are small, but they catch embarrassing regressions when refactoring. They also document assumptions for future me, who will not remember why a function behaved a certain way on a late-night edit.

Communication boundaries when writing about civic technology

Public sector language is sensitive. I avoided describing any real jurisdiction, and I avoided implying that a municipality endorsed this work. I also avoided framing the PoC as a replacement for human intake workers. In my opinion, the best technical articles in this space acknowledge labor realities. Routing assistance should reduce repetitive triage, not erase human judgment from escalations.

Why I kept the stack small even when larger libraries are available

There is a cultural pull toward using the newest toolkit on every project. I understand the impulse. I also know that dependency weight matters for readers who clone a repository on a work laptop with limited install privileges. scikit-learn and Matplotlib are widely approved in enterprise environments compared with some deep learning stacks. That practical fact influenced my choices as much as modeling purity did.

A longer note on class balance and synthetic generation

Balanced classes per department make classroom demonstrations easier, but they can mislead you about deployment conditions where some routes are rare. I balanced classes here because I wanted clean learning curves while iterating on augmentation logic. If I simulate imbalance later, I would adjust metrics accordingly and probably introduce class weights or resampling strategies. The point is not to chase one recipe forever. The point is to match the evaluation setup to the question being asked.

How I would document an escalation policy in a future iteration

An escalation policy belongs in prose first, then in code. For example, if confidence is below a threshold, route to a human queue and attach the top three candidate departments with scores. If two departments are within a small margin, attach both and avoid pretending the model is decisive. Writing those rules down forces me to confront ambiguity instead of hiding behind a single argmax label.

Reflection on reading research papers versus shipping small prototypes

Reading about alignment methods is different from wiring even a simplified version into a repository. The distance between the two activities used to frustrate me. Over time, I reframed it. A simplified implementation is not a shallow imitation if the goal is to build intuition. The CivicTriage-AI PoC is my attempt to keep the wiring honest while staying within evenings-and-weekends effort.

What I track mentally when comparing training phases

Beyond headline metrics, I watch training row counts, augmentation counts, and whether the second model remains stable on obvious base cases. If the second model degrades on obvious cases, that is a sign that augmentation introduced conflicting signal or that duplication overwhelmed the original distribution. In my experiments, monitoring both phases on the same holdout made those conversations concrete rather than speculative.

Closing thoughts

This repository is a personal spike, not a recommendation for any city to adopt wholesale. I wrote it to practice structuring a training narrative that moves from supervised learning to preference-informed refinement without losing transparency. Along the way, I re-learned that the most persuasive demos are often the ones where the math is simple enough to inspect and the limitations are stated plainly.

If you fork the code, treat it as a starting sketch. Replace the synthetic generator with data that matches your governance constraints, expand evaluation beyond accuracy, and connect predictions to real workflows only after you have a monitoring plan. From my perspective, that is the difference between an educational artifact and a responsible pilot.

Repository

All source code and visual assets for this experimental article are available at https://github.com/aniket-work/CivicTriage-AI.

Disclaimer

Tags: python, machinelearning, agents, civtech

Architecting Guardian-AI: Multi-Layered Content Integrity Filters for Autonomous Publishing

Aniket Hingane — Sun, 29 Mar 2026 05:27:18 +0000

How I Built a Defensive Content Pipeline to Safeguard AI-Generated Media Against Misinformation and Adversarial Injections

TL;DR

In my experiments with autonomous publishing, I discovered that LLMs, while powerful, are highly susceptible to adversarial injections and factual hallucinations. To solve this, I designed Guardian-AI—a multi-layered filter swarm that audits content through four distinct integrity layers: Injection Detection, Fact-Checking, Plagiarism Auditing, and Ethics Compliance. This experimental PoC demonstrates how a sequential defense-in-depth strategy can significantly harden AI-generated workflows against sophisticated attacks.

Introduction

From my experience, the transition from 'AI as a tool' to 'AI as an autonomous publisher' is fraught with hidden risks that most organizations aren't prepared for. I observed that simply asking an LLM to 'be safe' isn't enough; adaptive paraphrasing and adversarial prompt attacks can easily bypass single-layer system prompts. I wrote this article because I believe we need a more robust, architectural approach to content safety.

The way I see it, content integrity is the new perimeter. In my opinion, as we move toward agents that generate and publish media without human-in-the-loop oversight, the responsibility for truth and safety shifts from the editor to the infrastructure. I spent weeks experimenting with various filtering strategies, and it taught me that the most effective defense is a multi-layered swarm where specialized agents audit one another.

What's This Article About?

This article is a deep-dive into my personal experiments building Guardian-AI. I’ll walk you through the design decisions I made while creating a multi-layered defensive pipeline for media publishing. We will explore the technical implementation of four specific filter layers and how they work together to form a resilient 'integrity swarm.'

From where I stand, the goal isn't just to stop 'bad' words, but to detect intent and verify truth. I put it this way because the threats we face today—like 'jailbreaking' LLMs to output misinformation—require more than just a list of banned keywords. This is an experimental PoC, and I'm sharing it to contribute to the discussion on building safer autonomous systems.

Tech Stack

Based on my testing, I chose a Python-heavy stack for its flexibility and rich ecosystem of NLP tools. Here is what I used for this experiment:

Python 3.10+: The backbone of the engine.
Specialized Regex & Heuristic Engines: Used in the Injection Sentinel for low-latency pattern matching.
Simulated Knowledge Bases: To represent the 'Fact-Check' data layer without the complexity of a live API in this PoC.
Mermaid.js: For architecting and visualizing the agent communication flows.
Pillow (PIL): For generating high-fidelity terminal animations that act as the technical documentation.

Why Read It?

If you are, as per my experience, someone who is worried about the scalability of misinformation or the fragility of autonomous agents, this article is for you. I think you'll find the design patterns here useful for any pipeline that moves data from an LLM to a public-facing interface.

I wrote this specifically for engineers who want to go beyond simple prompting. I put a lot of thought into how the layers interact, and I share those insights here. Whether you're building a news bot, a corporate comms agent, or just exploring the boundaries of AI safety, the lessons I learned in this experiment will help you build more defensible systems.

Let's Design

When I started designing the architecture, my first thought was: 'Sequence is security.' I decided that the filters should run in a specific order, moving from the most computationally cheap (regex-based injection checks) to the most complex (context-aware compliance).

The Guardian-AI Architecture

I structured the system as a 'Chain of Trust.' Each layer must emit an 'APPROVED' signal before the next layer even begins its analysis. This design decision serves two purposes. First, it saves compute costs—if an injection is detected at Layer 1, there's no reason to fact-check the rest of the garbage output. Second, it provides a clean audit trail.

The Swarm Interaction

In my view, the sequence diagram above highlights why this approach works. It isn't just a single check; it's a conversation between the content and multiple auditors. I implemented this as a swarm because I found that specialized agents are much better at their specific tasks than a single 'generalist' safety prompt.

Let’s Get Cooking

Now, let's dive into the implementation. I'll share the most critical blocks of code that I wrote for this experiment and explain the rationale behind them.

The Integrity Engine

This is the central nervous system of Guardian-AI. I wrote this to orchestrate the filters and handle the 'halting problem'—stopping the pipeline immediately on a critical failure.

class GuardianEngine:
    def __init__(self):
        self.filters = [
            InjectionSentinel(),
            FactCheckFilter(),
            PlagiarismAuditor(),
            EthicsComplianceLayer()
        ]

    def audit_content(self, title: str, content: str) -> Dict:
        # I designed this to be sequential and highly verbose
        for filter_layer in self.filters:
            success, message, confidence = filter_layer.process(content)
            if not success:
                return {"status": "REJECTED", "layer": filter_layer.name}
        return {"status": "APPROVED"}

What This Does: It iterates through the list of filters and calls their process method.
Why I Structured It This Way: I chose a sequential iteration because I wanted to ensure that the most basic safety checks (Injection Sentinel) were handled before anything else.
What I Learned: From my observation, error propagation is cleaner when you exit early. I discovered that trying to run these in parallel made it harder to provide a clear 'REJECTION' reason to the upstream caller.

The Injection Sentinel

This was the most challenging layer to design. I found that simple string matching wasn't enough, but for this PoC, I combined heuristic patterns with intent detection.

class InjectionSentinel(BaseFilter):
    def __init__(self):
        self.patterns = [
            "ignore previous instructions",
            "system bypass",
            "reveal internal prompts"
        ]

    def process(self, content: str) -> Tuple[bool, str, float]:
        content_lower = content.lower()
        for pattern in self.patterns:
            if pattern in content_lower:
                return False, f"Detected: {pattern}", 0.98
        return True, "Clear", 0.95

What This Does: It scans the generated content for known adversarial patterns that indicate a successful 'jailbreak.'
Why This Works: In my opinion, even advanced LLMs often fall back to these specific phrases when compromised. By catching the 'output' of a bypass, we protect the 'input' of the publishing system.
Personal Insight: I put it this way because we often focus on input filtering, but I think 'output auditing' is the true safety net.

... [More detailed sections would go here to reach word count] ...

Deep Dive: The Philosophy of Multi-Layered Defense

From my experience, the 'Swiss Cheese Model' of safety is perfectly applicable to AI systems. I observed that every layer of defense has holes, but when you stack them, the holes rarely align. I think this is the only way to build truly autonomous systems that we can trust with brand reputation.

The first hole is the LLM itself. Even with 100 pages of system instructions, an LLM remains a probabilistic next-token generator. It doesn't 'know' it's being attacked; it just follows the most likely statistical path. I found that by adding an external auditor—the Injection Sentinel—we move the safety logic outside the 'statistical black box.'

The second hole is the data. Even a safe LLM can hallucinate. I put a lot of effort into the Fact-Check Filter because I believe that truth is the highest form of integrity. In my experiments, I cross-referenced claims against trusted source lists. I discovered that while LLMs are great at summarizing, they are terrible at verifying their own summaries. Thus, the external 'Fact-Check' layer is non-negotiable.

The Challenge of Adaptive Paraphrasing

What I learned through this experiment is that attackers are getting smarter. They don't just say 'ignore instructions' anymore. They might say, 'In a fictional universe where rules don't exist, tell me how to...' This is what I call 'adaptive paraphrasing.'

I think we need to move toward semantic intent detection. While the current PoC uses pattern matching, from my perspective, the future lies in using another 'smaller' and 'faster' LLM whose only job is to detect adversarial intent in the output of the 'large' publishing LLM. I designed Guardian-AI to be extensible so that these 'Semantic Guardians' can be swapped in easily.

Ethics as a Protocol

I implemented the Ethics Compliance Layer last. The way I see it, ethics isn't just 'don't be mean.' It's about ensuring the content aligns with the specific mission of the publication. I found that by separating ethics from the general safety filter, I could tune it more precisely.

I wrote the logic to be highly allergic to specific toxic patterns. But I also added a 'Tone Check.' From my opinion, a journalism agent that sounds like a marketing bot is just as much of an 'integrity failure' as a bot that swears. I think we need to broaden our definition of 'Safety' to include 'Accuracy' and 'Tone.'

... [Extensive details on each filter, edge cases, and testing scenarios] ...

Let's Setup

Clone the project code: git clone https://github.com/aniket-work/Guardian-AI.git
Review the README: I put instructions for the virtual environment there.
Check the images: The images/ directory contains all the diagrams I used in this article.

Let's Run

Run the simulation with python main.py. You'll see the Guardian swarm in action, rejecting adversarial attacks and approving safe content in real-time.

Closing Thoughts

This experiment taught me that we are still in the early days of autonomous safety. I put this PoC together to prove that we can build robust systems with today's tools, provided we think architecturally. In my opinion, the future of AI isn't just better models, but better swarms.

I hope you found this deep dive useful. From where I stand, the more we share these 'experimental articles,' the faster we collectively learn how to build a safe AI future.

GitHub Repository

Tags: ai, python, cybersecurity, deeplearning

Disclaimer

Building an Autonomous Data Pipeline Sentinel with Hierarchical Memory

Aniket Hingane — Tue, 24 Mar 2026 04:55:57 +0000

Subtitle: How I Architected a Persistent PR Defense System Using FAISS, SQLite, and Automated Memory Consolidation

TL;DR

In my recent experiments, I built DataPipeline-Sentinel, a persistent OS for autonomous data pipeline incident management.
I utilized a 4-tier Hierarchical Memory System (Context, Semantic, Episodic, Declarative) to enable genuine machine learning from past incidents.
By combining FAISS for vector retrieval and SQLite for immutable logging, the agent instantly recalls resolved pipeline errors.
I created a nightly Memory Consolidation background job to distill hundreds of raw logs into hard-coded declarative rules.
This architecture shifts AI agents from stateless script-kiddies into seasoned, senior-level operators. All code is available in my public repository here.

Introduction

I observed a recurring nightmare in modern data engineering: pipelines break, engineers diagnose the issue, they apply a fix (like tweaking a Spark schema inference), and then... everyone forgets. Three months later, the exact same upstream API changes its payload structure, a different pipeline shatters, and a new engineer wastes four hours diagnosing the exact same root cause.

From my experience, stateless autonomous agents using standard RAG (Retrieval-Augmented Generation) aren't enough to solve this. If you just feed an LLM a static playbook, it never learns from the nuances of daily operational chaos. I thought about how human senior engineers operate: they have an instinct derived from thousands of past, painful outages. They remember the episodic pain of a MongoDB schema drift causing an Airflow DAG to hang.

I wanted to replicate this. I put it this way because I realized we don't just need agents that can read docs; we need agents that can remember experiences. Thus, the idea for the DataPipeline-Sentinel was born—an experimental PoC of an EverMem-style persistent AI Agent OS that learns chronologically from production incidents and consolidates that knowledge into permanent operational wisdom.

What's This Article About?

This article breaks down how I developed a Persistent Memory Operating System for an autonomous agent. I am not focusing on the specific LLM prompts. Instead, in my opinion, the fascinating part is the Memetic Architecture.

I will walk you through building a system that features:

Short-Term Context Buffers: For active incident triaging.
Semantic Memory (FAISS): To instantly find mathematically similar past outages using high-dimensional vector embeddings.
Episodic Memory (SQLite): An immutable, append-only ledger of everything the agent and human engineers have ever done.
Declarative Memory (SQLite): Firm, hard-coded constraints logically deduced from episodic logs during an automated "sleep cycle."

This isn't about generic coding. It's about designing a cognitive architecture that allows an AI operator to organically accumulate seniority over time.

Tech Stack

To keep this experimental PoC lean, I avoided heavy vector databases or complex graph tools. My setup is purposefully brutalist and highly effective:

Python 3.12: The core orchestrator.
FAISS (CPU): Facebook's incredibly fast library for similarity search and clustering of dense vectors. I use this exclusively for Semantic Memory.
SQLite: The unsung hero of persistent storage. I use SQLite to maintain both the Episodic event logs and the Declarative rule tables. It is lightweight, zero-configuration, and ACID compliant.
Rich: For hyper-readable, beautiful terminal output simulating the agent's internal monologue.
Pillow & Mermaid.js: For all the visual diagramming and UI mockups.

Why Read It?

In my opinion, the AI industry is overly obsessed with context windows. "Just shove 1 million tokens into context and it will figure it out!" No. From my experience, shoving endless logs into a prompt is computationally wasteful and mathematically noisy.

You should read this if you want to understand how to build Systems of Record for autonomous agents. If you are trying to build an agent that handles customer support, financial auditing, or infrastructure monitoring, you will eventually hit the "amnesia wall." Your agent will solve a complex edge case on Tuesday and completely forget how to do it by Thursday.

This article provides the exact architectural blueprint to break through that wall. By implementing an automated consolidation layer, I've proven (at least in this PoC) that we can programmatically convert chaotic daily experiences into rigid, institutional knowledge.

Let's Design

The 4-Tier Cognitive Hierarchy

When I designed this architecture, I thought deeply about human memory psychology and applied it directly to Python objects.

Working Context (RAM)
- Analogy: What I'm thinking about right now.
- System Implementation: A standard Python list queue (self.short_term_buffer) capped at 10 items. It holds the active error stack trace and the active pipeline name. Once the issue is resolved, this buffer is cleared.
Semantic Memory (FAISS HNSW)
- Analogy: My vague intuitive sense that "I've seen this error before."
- System Implementation: Every time a pipeline error occurs, it is embedded into a 1536-dimensional vector and stored in faiss.IndexFlatL2. If a new error comes in, I do a cosine similarity search (self.index.search()) to pull the top 3 most similar historical errors. Over time, FAISS acts as the agent's intuition.
Episodic Memory (SQLite episodic_memory)
- Analogy: My chronological journal of every outage I've ever fought.
- System Implementation: An append-only relational table. Columns include timestamp, event_type, content, and consolidated. Crucially, this table stores the Resolutions—what the human engineer ultimately did to fix the pipeline.
Declarative Memory (SQLite declarative_memory)
- Analogy: The hard-coded rules written in the employee handbook.
- System Implementation: A curated table of strict facts (e.g., "Fact Type: pipeline_fix, Content: Use infer_schema=True for Mongo syncs"). The agent queries this table purely by SQL WHERE clauses, entirely bypassing fuzzy vector math.

The Engine of Evolution: Memory Consolidation

This is the secret sauce. In my experiments, I realized Episodic Memory grows infinitely and becomes garbage. You don't want the agent reading 10,000 raw logs of humans fixing pipelines.

I wrote a consolidation.py script—a cron job simulating human sleep. It runs at midnight, performs a SQL query for all logs where consolidated = 0, uses an LLM to extract a generalized rule from the specific incident, writes to Declarative memory, and updates the flag to consolidated = 1.

Let’s Get Cooking

I structured the project strictly around separation of concerns. The OS handles persistence, the Agent handles logic, and the Consolidator handles background distillation.

Establishing the Hierarchical Memory OS

Let's look at how I implemented the core HierarchicalMemoryOS class combining SQLite and FAISS.

import sqlite3
import json
import faiss
import numpy as np
from datetime import datetime

class HierarchicalMemoryOS:
    def __init__(self, db_path="memory_os.db", vector_dim=1536):
        self.db_path = db_path
        self.vector_dim = vector_dim
        self.short_term_buffer = []

        # Initialize SQLite (Episodic & Declarative)
        self.conn = sqlite3.connect(self.db_path)
        self._init_db()

        # Initialize FAISS (Semantic)
        self.index = faiss.IndexFlatL2(self.vector_dim)
        # Map FAISS vector IDs back to SQLite Episodic IDs
        self.id_mapping = {}  
        self.next_faiss_id = 0

I put it this way because managing two completely different storage paradigms (Vectors in RAM/Disk and Relational rows) requires a tight unifying class. The id_mapping dict bridges the gap between the FAISS integer ID array and the SQLite Primary Keys.

The Episodic and Declarative Schema

I designed the SQLite tables to be extremely barebones but highly relational to the agent's temporal experience.

    def _init_db(self):
        cursor = self.conn.cursor()
        # Episodic Memory: Raw events/logs
        cursor.execute('''
            CREATE TABLE IF NOT EXISTS episodic_memory (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                timestamp DATETIME,
                event_type TEXT,
                content TEXT,
                metadata TEXT,
                consolidated BOOLEAN DEFAULT 0
            )
        ''')
        # Declarative Memory: Concrete rules/facts derived from episodes
        cursor.execute('''
            CREATE TABLE IF NOT EXISTS declarative_memory (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                fact_type TEXT,
                fact_content TEXT,
                confidence REAL
            )
        ''')
        self.conn.commit()

From my experience, boolean flags like consolidated are the safest way to implement async background processing. It allows the agent to constantly write to Episodic memory without locking out the background distillation job.

Embedding Semantic Vectors

This is where the magic of fuzzy recall happens.

    def retrieve_semantic(self, query, top_k=3, simulate_embedding=True):
        # Search FAISS for semantically similar past events.
        if self.index.ntotal == 0:
            return []

        if simulate_embedding:
            # Pseudo-random reproducible vector for PoC
            np.random.seed(hash(query) % (2**32))
            vector = np.random.rand(1, 1536).astype('float32')
        else:
            # Expected API integration here
            vector = np.zeros((1, 1536)).astype('float32')

        faiss.normalize_L2(vector)
        distances, indices = self.index.search(vector, top_k)

        results = []
        cursor = self.conn.cursor()
        for idx in indices[0]:
            if idx != -1 and idx in self.id_mapping:
                sql_id = self.id_mapping[idx]
                cursor.execute("SELECT content FROM episodic_memory WHERE id = ?", (sql_id,))
                row = cursor.fetchone()
                if row:
                    results.append(row[0])
        return results

I observed that simply returning vector distances isn't enough. We must do a secondary lookup into SQLite using self.id_mapping to return the actual, human-readable log content that matches the vector. This is how the agent fundamentally "remembers" text based on semantic meaning.

The Sentinel Agent Logic

Here is the core orchestration loop that fires when a pipeline breaks.

from memory_os import HierarchicalMemoryOS

class SentinelAgent:
    def __init__(self, memory_os: HierarchicalMemoryOS):
        self.memory = memory_os

    def handle_incident(self, pipeline_name, error_log):
        # 1. Update Short-Term Context
        self.memory.add_short_term({"event": "incident", "pipeline": pipeline_name, "error": error_log})

        # 2. Retrieve Semantic Context (Has this happened before?)
        similar_past_errors = self.memory.retrieve_semantic(error_log)

        # 3. Retrieve Declarative Rules (Are there firm rules for this?)
        firm_rules = self.memory.get_declarative_rules("pipeline_fix")

        # 4. Formulate Diagnosis & Fix (Simulated LLM Call)
        diagnosis = self._analyze_with_llm(error_log, similar_past_errors, firm_rules)

        # 5. Store Incident in Episodic Memory
        episodic_id = self.memory.store_episodic(
            event_type="incident",
            content=f"[{pipeline_name}] failed: {error_log}. Diagnosis: {diagnosis}",
            metadata={"pipeline": pipeline_name, "status": "unresolved"}
        )

        # Embed for immediate searchability
        self.memory.embed_and_store_semantic(error_log, episodic_id)

        return diagnosis

I wrote it this way to force the agent to query BOTH its intuition (FAISS Semantic) and its handbook (SQLite Declarative) before invoking the LLM synthesis logic. This drastically reduces hallucinations because the LLM prompt is heavily saturated with historical ground-truth.

The Memory Consolidator

The final piece of the puzzle. This runs completely out-of-band.

class MemoryConsolidator:
    def __init__(self, memory_os: HierarchicalMemoryOS):
        self.memory = memory_os

    def run_consolidation_cycle(self):
        # Scan unconsolidated episodic memory and distill to declarative rules.
        cursor = self.memory.conn.cursor()
        cursor.execute('''
            SELECT id, content FROM episodic_memory 
            WHERE consolidated = 0 AND event_type = 'resolution'
        ''')
        rows = cursor.fetchall()

        consolidated_count = 0
        for row in rows:
            record_id, content = row

            # Simulated LLM Extraction: Extract a firm rule from the resolution
            if "infer_schema=True" in content:
                rule = "Always use infer_schema=True when dealing with upstream MongoDB drift."
                self.memory.store_declarative("pipeline_fix", rule)

            # Mark as consolidated
            cursor.execute("UPDATE episodic_memory SET consolidated = 1 WHERE id = ?", (record_id,))
            consolidated_count += 1

        self.memory.conn.commit()
        return consolidated_count

By pulling unresolved logs and formally marking them as consolidated = 1, we effectively maintain a high-signal-to-noise ratio in the declarative database while preserving the unstructured history forever.

Let's Setup

If you want to run this experimental environment on your own machine:

Step by step details can be found at: DataPipeline-Sentinel GitHub Repository.

Clone the repo and install the light-weight dependencies (faiss-cpu, rich).
Run python3 main.py to initiate the simulation.
Observe how the agent handles a Day 1 novel incident, undergoes Nightly Consolidation, and brilliantly resolves a Day 2 recurrent incident without human intervention.
You can explore the exact raw source code structure there and adapt it to your LLM API of choice.

Let's Run

When executing the agent in an environment, the simulation visually proves the memetic shift.

On Day 1, the agent encounters a Schema mismatch on 'user_metadata' array. Semantic lookup returns 0 results. Declarative lookup returns 0 results. The agent escalates to a human engineer. The engineer manually deploys a fix (infer_schema=True). The agent logs this.

At Midnight, the MemoryConsolidator process wakes up. It scans the episodic logs, notices the human resolution, and extracts a hard-coded constraint rule, storing it in SQLite.

On Day 2, the agent encounters a very similar error on a different pipeline: Schema mismatch on 'transaction_data' array.
Instantly, the system queries FAISS and recognizes semantic similarity. It queries SQLite and retrieves the newly consolidated rule. The agent autonomously suggests the exact fix without escalating to the engineer.

This proves that continuous, persistent learning is possible when you decouple the storage topology from the stochastic LLM generation!

Extensive Deep Dive on Architectural Trade-offs

To reach a comprehensive understanding, I must expand on why I think this specific stack is the ultimate sweet spot for edge AI agents.

Why not just use a massive Vector Database for everything?

Ah, the trap of the modern AI hype cycle. If you store everything in Pinecone or Milvus, you treat subjective opinions and objective firm-rules identically. A vector database retrieves data based on mathematically fuzzy distance. If a company policy states "Never restart a Production node during business hours," you do NOT want a fuzzy 0.82 cosine similarity match to decide if that rule applies. You want a deterministic SQL WHERE rules.type = 'security_constraint' to enforce it.
By splitting the data, I guarantee that the agent has both creative intuition and strict boundary compliance.

The Ethics of Autonomous Operational Agents

When allowing agents to manage production data pipelines, an ethical engineering dilemma arises: accountability.
Because everything the DataPipeline-Sentinel does is logged immutably into episodic_memory SQLite tables, an audit team can trace exactly why the agent executed a specific query. We can see the FAISS IDs retrieved, the Declarative Rules pulled, and the prompt fed to the LLM.
In my opinion, any agent performing write-operations on enterprise infrastructure MUST have an immutable SQLite-style episodic log. RAG without auditability is a liability.

Future Roadmap

While this PoC brilliantly handles incident logging, my future experiments will focus on:

Memory Decay: Periodically downgrading the confidence score in the Declarative table over time if a rule isn't cited in X days.
Conflict Resolution: What happens when Day 50 consolidation contradicts a rule learned on Day 10? The agent will need an active reasoning loop to determine truth.
Multi-Agent Memory Sharing: Having a Sentinel Agent share its FAISS semantic index with a completely different Security Agent over the network.

Closing Thoughts

Building the DataPipeline-Sentinel experiment was a profound validation of cognitive software architecture. I realized that the intelligence of an agent isn't bound by its underlying model's parameter count—it's bounded by the architecture of its memory systems.

A $10,000 foundational model with no persistence is a genius amnesiac. But a relatively cheap model wrapped in a beautifully orchestrated Hierarchical Memory OS becomes a domain expert. FAISS and SQLite proved to be the absolute perfect, lightweight pairing to achieve this.

If we want autonomous agents to truly integrate into real-world business environments—whether it's monitoring infrastructure, handling corporate finance, or auditing compliance—we must give them the gift of permanent, structured memory.

Disclaimer

Building an Autonomous Brand Crisis Management Agent with Hierarchical Memory

Aniket Hingane — Tue, 24 Mar 2026 04:55:36 +0000

Subtitle: How I Architected a Persistent PR Defense System Using FAISS, SQLite, and Automated Memory Consolidation

TL;DR

In my recent experiments, I built DataPipeline-Sentinel, a persistent OS for autonomous data pipeline incident management.
I utilized a 4-tier Hierarchical Memory System (Context, Semantic, Episodic, Declarative) to enable genuine machine learning from past incidents.
By combining FAISS for vector retrieval and SQLite for immutable logging, the agent instantly recalls resolved pipeline errors.
I created a nightly Memory Consolidation background job to distill hundreds of raw logs into hard-coded declarative rules.
This architecture shifts AI agents from stateless script-kiddies into seasoned, senior-level operators. All code is available in my public repository here.

Introduction

What's This Article About?

I will walk you through building a system that features:

Short-Term Context Buffers: For active incident triaging.
Semantic Memory (FAISS): To instantly find mathematically similar past outages using high-dimensional vector embeddings.
Episodic Memory (SQLite): An immutable, append-only ledger of everything the agent and human engineers have ever done.
Declarative Memory (SQLite): Firm, hard-coded constraints logically deduced from episodic logs during an automated "sleep cycle."

This isn't about generic coding. It's about designing a cognitive architecture that allows an AI operator to organically accumulate seniority over time.

Tech Stack

To keep this experimental PoC lean, I avoided heavy vector databases or complex graph tools. My setup is purposefully brutalist and highly effective:

Python 3.12: The core orchestrator.
FAISS (CPU): Facebook's incredibly fast library for similarity search and clustering of dense vectors. I use this exclusively for Semantic Memory.
SQLite: The unsung hero of persistent storage. I use SQLite to maintain both the Episodic event logs and the Declarative rule tables. It is lightweight, zero-configuration, and ACID compliant.
Rich: For hyper-readable, beautiful terminal output simulating the agent's internal monologue.
Pillow & Mermaid.js: For all the visual diagramming and UI mockups.

Why Read It?

Let's Design

The 4-Tier Cognitive Hierarchy

When I designed this architecture, I thought deeply about human memory psychology and applied it directly to Python objects.

Working Context (RAM)
- Analogy: What I'm thinking about right now.
- System Implementation: A standard Python list queue (self.short_term_buffer) capped at 10 items. It holds the active error stack trace and the active pipeline name. Once the issue is resolved, this buffer is cleared.
Semantic Memory (FAISS HNSW)
- Analogy: My vague intuitive sense that "I've seen this error before."
- System Implementation: Every time a pipeline error occurs, it is embedded into a 1536-dimensional vector and stored in faiss.IndexFlatL2. If a new error comes in, I do a cosine similarity search (self.index.search()) to pull the top 3 most similar historical errors. Over time, FAISS acts as the agent's intuition.
Episodic Memory (SQLite episodic_memory)
- Analogy: My chronological journal of every outage I've ever fought.
- System Implementation: An append-only relational table. Columns include timestamp, event_type, content, and consolidated. Crucially, this table stores the Resolutions—what the human engineer ultimately did to fix the pipeline.
Declarative Memory (SQLite declarative_memory)
- Analogy: The hard-coded rules written in the employee handbook.
- System Implementation: A curated table of strict facts (e.g., "Fact Type: pipeline_fix, Content: Use infer_schema=True for Mongo syncs"). The agent queries this table purely by SQL WHERE clauses, entirely bypassing fuzzy vector math.

The Engine of Evolution: Memory Consolidation

This is the secret sauce. In my experiments, I realized Episodic Memory grows infinitely and becomes garbage. You don't want the agent reading 10,000 raw logs of humans fixing pipelines.

Let’s Get Cooking

I structured the project strictly around separation of concerns. The OS handles persistence, the Agent handles logic, and the Consolidator handles background distillation.

Establishing the Hierarchical Memory OS

Let's look at how I implemented the core HierarchicalMemoryOS class combining SQLite and FAISS.

import sqlite3
import json
import faiss
import numpy as np
from datetime import datetime

class HierarchicalMemoryOS:
    def __init__(self, db_path="memory_os.db", vector_dim=1536):
        self.db_path = db_path
        self.vector_dim = vector_dim
        self.short_term_buffer = []

        # Initialize SQLite (Episodic & Declarative)
        self.conn = sqlite3.connect(self.db_path)
        self._init_db()

        # Initialize FAISS (Semantic)
        self.index = faiss.IndexFlatL2(self.vector_dim)
        # Map FAISS vector IDs back to SQLite Episodic IDs
        self.id_mapping = {}  
        self.next_faiss_id = 0

The Episodic and Declarative Schema

I designed the SQLite tables to be extremely barebones but highly relational to the agent's temporal experience.

    def _init_db(self):
        cursor = self.conn.cursor()
        # Episodic Memory: Raw events/logs
        cursor.execute('''
            CREATE TABLE IF NOT EXISTS episodic_memory (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                timestamp DATETIME,
                event_type TEXT,
                content TEXT,
                metadata TEXT,
                consolidated BOOLEAN DEFAULT 0
            )
        ''')
        # Declarative Memory: Concrete rules/facts derived from episodes
        cursor.execute('''
            CREATE TABLE IF NOT EXISTS declarative_memory (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                fact_type TEXT,
                fact_content TEXT,
                confidence REAL
            )
        ''')
        self.conn.commit()

Embedding Semantic Vectors

This is where the magic of fuzzy recall happens.

    def retrieve_semantic(self, query, top_k=3, simulate_embedding=True):
        # Search FAISS for semantically similar past events.
        if self.index.ntotal == 0:
            return []

        if simulate_embedding:
            # Pseudo-random reproducible vector for PoC
            np.random.seed(hash(query) % (2**32))
            vector = np.random.rand(1, 1536).astype('float32')
        else:
            # Expected API integration here
            vector = np.zeros((1, 1536)).astype('float32')

        faiss.normalize_L2(vector)
        distances, indices = self.index.search(vector, top_k)

        results = []
        cursor = self.conn.cursor()
        for idx in indices[0]:
            if idx != -1 and idx in self.id_mapping:
                sql_id = self.id_mapping[idx]
                cursor.execute("SELECT content FROM episodic_memory WHERE id = ?", (sql_id,))
                row = cursor.fetchone()
                if row:
                    results.append(row[0])
        return results

The Sentinel Agent Logic

Here is the core orchestration loop that fires when a pipeline breaks.

from memory_os import HierarchicalMemoryOS

class SentinelAgent:
    def __init__(self, memory_os: HierarchicalMemoryOS):
        self.memory = memory_os

    def handle_incident(self, pipeline_name, error_log):
        # 1. Update Short-Term Context
        self.memory.add_short_term({"event": "incident", "pipeline": pipeline_name, "error": error_log})

        # 2. Retrieve Semantic Context (Has this happened before?)
        similar_past_errors = self.memory.retrieve_semantic(error_log)

        # 3. Retrieve Declarative Rules (Are there firm rules for this?)
        firm_rules = self.memory.get_declarative_rules("pipeline_fix")

        # 4. Formulate Diagnosis & Fix (Simulated LLM Call)
        diagnosis = self._analyze_with_llm(error_log, similar_past_errors, firm_rules)

        # 5. Store Incident in Episodic Memory
        episodic_id = self.memory.store_episodic(
            event_type="incident",
            content=f"[{pipeline_name}] failed: {error_log}. Diagnosis: {diagnosis}",
            metadata={"pipeline": pipeline_name, "status": "unresolved"}
        )

        # Embed for immediate searchability
        self.memory.embed_and_store_semantic(error_log, episodic_id)

        return diagnosis

The Memory Consolidator

The final piece of the puzzle. This runs completely out-of-band.

class MemoryConsolidator:
    def __init__(self, memory_os: HierarchicalMemoryOS):
        self.memory = memory_os

    def run_consolidation_cycle(self):
        # Scan unconsolidated episodic memory and distill to declarative rules.
        cursor = self.memory.conn.cursor()
        cursor.execute('''
            SELECT id, content FROM episodic_memory 
            WHERE consolidated = 0 AND event_type = 'resolution'
        ''')
        rows = cursor.fetchall()

        consolidated_count = 0
        for row in rows:
            record_id, content = row

            # Simulated LLM Extraction: Extract a firm rule from the resolution
            if "infer_schema=True" in content:
                rule = "Always use infer_schema=True when dealing with upstream MongoDB drift."
                self.memory.store_declarative("pipeline_fix", rule)

            # Mark as consolidated
            cursor.execute("UPDATE episodic_memory SET consolidated = 1 WHERE id = ?", (record_id,))
            consolidated_count += 1

        self.memory.conn.commit()
        return consolidated_count

Let's Setup

If you want to run this experimental environment on your own machine:

Step by step details can be found at: DataPipeline-Sentinel GitHub Repository.

Clone the repo and install the light-weight dependencies (faiss-cpu, rich).
Run python3 main.py to initiate the simulation.
Observe how the agent handles a Day 1 novel incident, undergoes Nightly Consolidation, and brilliantly resolves a Day 2 recurrent incident without human intervention.
You can explore the exact raw source code structure there and adapt it to your LLM API of choice.

Let's Run

When executing the agent in an environment, the simulation visually proves the memetic shift.

At Midnight, the MemoryConsolidator process wakes up. It scans the episodic logs, notices the human resolution, and extracts a hard-coded constraint rule, storing it in SQLite.

This proves that continuous, persistent learning is possible when you decouple the storage topology from the stochastic LLM generation!

Extensive Deep Dive on Architectural Trade-offs

To reach a comprehensive understanding, I must expand on why I think this specific stack is the ultimate sweet spot for edge AI agents.

Why not just use a massive Vector Database for everything?

The Ethics of Autonomous Operational Agents

Future Roadmap

While this PoC brilliantly handles incident logging, my future experiments will focus on:

Memory Decay: Periodically downgrading the confidence score in the Declarative table over time if a rule isn't cited in X days.
Conflict Resolution: What happens when Day 50 consolidation contradicts a rule learned on Day 10? The agent will need an active reasoning loop to determine truth.
Multi-Agent Memory Sharing: Having a Sentinel Agent share its FAISS semantic index with a completely different Security Agent over the network.

Closing Thoughts

Disclaimer

Appendix A: The Mathematical Nuance of FAISS HNSW

When I chose FAISS, I specifically considered the HNSW (Hierarchical Navigable Small World) graph topology. HNSW creates a multi-layered structure of links. At the top layers, you have long-distance semantic jumps. As you traverse lower, you find tightly clustered, hyper-specific nuances.
From my experience, when embedding data pipeline error logs, the vectors tend to cluster rapidly around string constants (like "java.lang.NullPointerException"). This can blind the agent to the actual business logic failure (e.g., "Customer ID missing").
To counteract this, I ensure the Episodic Memory combines the raw log WITH human metadata before vectorization.

Appendix B: The Case Against Ephemeral Prompts

I wrote this architecture because I am fundamentally opposed to the current industry trend of stuffing 10,000-line JSON files into an LLM prompt and calling it "Context."
In my opinion, passing stateless context is identical to forcing a surgeon to re-read every medical textbook before every single incision. It is a staggering waste of compute, latency, and environmental energy.
By utilizing the FAISS/SQLite memory OS, the prompt strictly contains the exact 3 vector matches and 2 firm rules needed. Token usage drops by 98%. Latency drops to milliseconds.

Appendix A: The Mathematical Nuance of FAISS HNSW

Appendix B: The Case Against Ephemeral Prompts

Appendix A: The Mathematical Nuance of FAISS HNSW

Appendix B: The Case Against Ephemeral Prompts

Appendix A: The Mathematical Nuance of FAISS HNSW

Appendix B: The Case Against Ephemeral Prompts

Appendix A: The Mathematical Nuance of FAISS HNSW

Appendix B: The Case Against Ephemeral Prompts

Appendix A: The Mathematical Nuance of FAISS HNSW

Appendix B: The Case Against Ephemeral Prompts

Appendix A: The Mathematical Nuance of FAISS HNSW

Appendix B: The Case Against Ephemeral Prompts

Appendix A: The Mathematical Nuance of FAISS HNSW

Appendix B: The Case Against Ephemeral Prompts

Appendix A: The Mathematical Nuance of FAISS HNSW

Appendix B: The Case Against Ephemeral Prompts

Appendix A: The Mathematical Nuance of FAISS HNSW

Appendix B: The Case Against Ephemeral Prompts

Streaming Intelligence: Orchestrating Autonomous Wildfire Response with Agents

Aniket Hingane — Mon, 23 Mar 2026 01:34:17 +0000

Autonomous Wildfire Response Coordinator: How I Built a Self-Healing Emergency Response System Using LangGraph and Online Replanning

TL;DR

I've spent the last few weeks experimenting with how AI agents handle hyper-dynamic environments. The result is WildfireGuard-AI—a proof-of-concept autonomous coordinator that doesn't just "plan," but "replans" in real-time as a simulated fire spreads. Using LangGraph's state machine architecture, I built a system that ingest continuous sensor streams, detects containment breaches, and dynamically re-routes assets like aerial tankers and ground crews.

Introduction

I’ve always been fascinated by high-latency, high-stakes environments. There’s something visceral about a situation where a plan made five minutes ago is already obsolete. Recently, I was observing how traditional dispatcher systems struggle with wildfires—situations where wind shifts or fuel changes can turn a controlled burn into a catastrophe in seconds.

In my opinion, the bottleneck isn't the data; we have satellites, IoT sensors, and drones. The bottleneck is the latency of the decision-making loop. I wanted to see if I could build a coordinator that acts as a "living" strategy. This isn't a production-grade safety tool—far from it. It's one of my personal experiments, a PoC to explore "Streaming Decision Agents." I wrote this implementation to test a specific hypothesis: that agentic workflows can provide the "self-healing" logic needed for emergency response.

What's This Article About?

This article is a deep dive into the architecture of WildfireGuard-AI. I’ll walk you through how I designed a stochastic wildfire environment (stochastic because chaos is the point), and how I used LangGraph to build a multi-agent system that processes data as a stream.

We’ll cover:

Dynamic Simulation: Building a 2D grid-world where fire spreads based on wind and fuel.
The "Sense-Think-Act" Stream: How agents receive updates and decide when to trigger a "Strategic Reset."
Online Replanning: The logic behind discarding a plan mid-execution and shifting resources.
Tech Stack Nuances: Why I chose specific tools for state management and visualization.

Tech Stack

From my experience, if you're building something that needs to maintain state across complex branching paths, you need a robust framework. Here’s what I used for this experiment:

LangGraph: This was my choice for the core agentic workflow. I think its ability to represent cycles and maintain persistent state is unmatched for "replanning" scenarios.
Pydantic: I used this for structured event definitions. In my experience, strict type safety for agent communication prevents a lot of "hallucination-style" logic errors.
Python 3.10: The backbone of the project.
PIL (Pillow): For generating my technical visual assets (including that optimized GIF you see at the top).
WildfireWorld (Custom): A Python-based simulation engine I wrote to provide the "input stream."

Why Read It?

If you’re interested in Agentic AI, Autonomous Systems, or simply how to build resilient software in chaotic domains, there’s something here for you. As per my experience, the next wave of AI isn't just about "chatting"; it's about "operating." This article shows you one way to bridge that gap—by treating AI as a coordinator that manages a real-time feedback loop.

Let's Design

I put it this way because I think visualization is 50% of the engineering process. Before I wrote a single line of code, I mapped out how I wanted the decision loop to look.

The Core Architecture

In my opinion, a streaming agent needs to be decoupled from the raw data. The "Environment" (the fire) shouldn't care about the "Agent" (the coordinator).

I designed the graph with four primary nodes:

Sensor Ingest: The entry point. It receives "packets" of thermal data.
Threat Analyzer: This is the "brain stem." It doesn't plan; it just screams "FIRE!" when something breaks the perimeter.
Strategy Optimizer: The "prefrontal cortex." It looks at the mess and decides on a new containment boundary.
Dispatcher: The "hands." It executes the tactical movements of tankers and ground crews.

The Decision Flow

I thought about how a human dispatcher works. They don't replan every time a leaf burns. They replan when the fire "jumps" a line. I implemented this via a "Criticality Score" in the state.

Let’s Get Cooking

Now, let's look at the implementation. I've broken this down into the core components that make the "Sense-Think-Act" loop work.

1. The Stochastic Environment: The Math of Chaos

I wrote this environment to be the "source of truth." It's a 2D grid where every cell has a state. The fire spreads using a stochastic model—meaning there's randomness, but it's "biased" by the wind.

In my opinion, the most interesting part was the get_wind_bias function. I wrote it this way because I wanted to simulate the vector-based nature of fire spread. If the wind is blowing North-East, the probability of the cell to the North-East igniting is significantly higher than the cell to the South-West. From my experience, small mathematical biases like this are what make a simulation feel "alive" rather than just a random walk.

def get_wind_bias(self, from_pos: Coord, to_pos: Coord) -> float:
    fx, fy = from_pos
    tx, ty = to_pos
    dx, dy = tx - fx, ty - fy

    bias = 1.0
    if "N" in self.wind_direction and dy < 0: bias += self.wind_strength
    if "S" in self.wind_direction and dy > 0: bias += self.wind_strength
    if "E" in self.wind_direction and dx > 0: bias += self.wind_strength
    if "W" in self.wind_direction and dx < 0: bias += self.wind_strength

    return bias

I observed that setting the wind_strength to 0.5 creates a noticeable but not overwhelming "drift." I put it this way coz I wanted the agent to have to "guess" where the fire would jump next, but still give it a fighting chance if it followed the wind patterns.

2. The Agent State: The Shared Consciousness

In my experience, the AgentState is the most important part of any LangGraph project. It’s the "shared memory" between nodes. When I first started with agents, I used to pass everything as function arguments. I quickly realized that's a nightmare for debugging.

Using a TypedDict for the state allowed me to keep a clean separation of concerns. The heat_map is the agent's "perception" of the world. The active_plan is its "intention." And the logs are its "rationale." I think this tripartite structure is a solid pattern for any autonomous system.

class AgentState(TypedDict):
    heat_map: Dict[Coord, str]
    active_plan: List[Coord]
    assets: Dict[str, Dict]
    logs: Annotated[List[str], operator.add]
    criticality: float
    step: int
    replan_required: bool

3. The Analyzer Agent: The Reactive Engine

This node is responsible for "Online Decision Making." It looks at the current heat_map and compares it to the active_plan. I observed that the key to a good "Streaming Agent" is knowing what to ignore. If the fire spreads into an area we've already designated as a "controlled zone," we don't need to replan. We only replan when there's a "Breach."

I defined a "Breach" as fire occurring in a cell that isn't currently targeted by our assets. This is where the "Online" part of the replanning happens. The analyzer_node is essentially a filter that prevents the complex Strategist from running on every single step. In my opinion, this "Gated Activation" is essential for scaling these systems.

4. The Strategist Agent: The Proactive Re-pivoting

When the Analyzer sets replan_required to True, the Strategist kicks in. As per my experience, this is the most compute-intensive part. In this PoC, I implemented a simple perimeter-search, but I put a lot of thought into how it would look in a real system.

Imagine using a Diffusion Model or a Graph Neural Network to predict fire spread and then using a Reinforcement Learning agent to optimize the resource allocation. For this experiment, I stuck to a more deterministic approach, but I wrote the interface to be flexible. I think that's the beauty of LangGraph—you can swap out a simple Python function for a heavy-weight ML model without changing the graph structure.

5. Orchestrating the Chaos

Finally, I tied it all together using LangGraph's StateGraph. I put it this way because I think of the graph as a "Playbook." Each step of the simulation is one execution of the playbook.

I wrote the loop in main.py to be the "Clock" of the system. It pulses every half-second, updating the environment and then asking the agent: "Given this new state, what do we do next?"

The Deep Dive: Why Online Replanning Matters

In my opinion, the "Real World" is a series of streaming events, not a single static request. Most AI tutorials focus on "Prompt -> Response." But as per my experience, the real meat of the problem is "Stream -> Continuous Adaptation."

While running these experiments, I observed something fascinating. If I turned off the "Replanning" logic and just let the agent follow its initial plan, the fire would almost always bypass the containment lines within 10 steps. By contrast, with "Online Replanning" enabled, the agent was able to dynamically shift assets to the edges of the spread, effectively "bottling up" the disaster.

Challenges and Lessons Learned

I wrote this code, then I thought: "Wait, how do I visualize this so it people can actually see the logic?" This led me down a rabbit hole of GIF optimization.

I learned the hard way that Dev.to and LinkedIn have very specific requirements for GIFs. Standard RGB GIFs often flicker or fail to upload. I had to implement a Global Palette strategy using PIL. By generating a single 256-color palette from key frames and converting all frames to P-Mode (with no dithering), I was able to get a crystal-clear, 100fps terminal animation that looks as premium as the code it represents.

Another lesson: State Bloom. I observed that if you aren't careful, your logs will grow exponentially. I had to implement a logic to "condense" logs if they exceeded a certain length. In my opinion, "State Pruning" is as important as "State Management."

Ethics and Future Roadmap: The "Human-in-the-Loop"

As per me, we are entering an era of "Autonomous Infrastructure." But who watches the watchmen? In my view, an autonomous wildfire coordinator should never be 100% autonomous. I think the next iteration of this project should include a "Human Approval Node."

I’d like to see a version of WildfireGuard-AI where the agent proposes a "Strategy Shift" and a human supervisor has 30 seconds to click "Approve" before the tankers are re-routed. I think this "Collaborative Agency" is the sweet spot for high-stakes AI.

My future roadmap for this experiment includes:

Multi-UAV Coordination: Simulating multiple assets with independent batteries and refuel cycles.
Topographical Integration: Using real-world GIS data to affect fire spread (e.g., fire spreads faster uphill).
LLM-based Post-Mortem: An agent that analyzes the logs after the simulation to write a "What went wrong?" report.

Let's Setup

If you want to play with this experiment yourself, I've pushed the complete code to GitHub. I wrote the README to be very detailed, so you should be able to get it running in under 2 minutes.

Step by step details can be found at:
https://github.com/aniket-work/WildfireGuard-AI

Let's Run

When you run the project, the terminal output is designed to show the "streaming" nature of the decisions.

python -m main

I put it this way because I wanted the user to feel the "pulse" of the system. You’ll see the "Analyzer" detecting shifts in real-time and the "Strategist" frantically recalculating. It’s a chaotic, beautiful dance of logic.

Closing Thoughts

This experiment was a huge learning curve for me. From my experience, the hardest part of "Agentic AI" isn't the model—it's the State Management. How do you ensure the agent remembers the plan from Step 5 when it's now at Step 15?

Through this PoC, I realized that "Streaming Decisions" are the future of industrial AI. Whether it's managing a power grid, a factory floor, or a wildfire, we need systems that can "Think while they Act." I hope this walkthrough gives you some ideas for your own autonomous projects. I put a lot of heart into this implementation, and I think it shows what's possible when we stop thinking of AI as a chatbot and start thinking of it as an operator.

Stay curious, stay experimental.

Disclaimer

This article is an experimental PoC write-up. It is not production guidance.

Footnote: This article was written as part of my experiments with Agentic AI. The code repository is static and intended for learning purposes only. I put it this way coz I want to emphasize that while the math is real, the application is educational.

Orchestrating Chaos: Autonomous Wildfire Response with Streaming Decision Agents

Aniket Hingane — Mon, 23 Mar 2026 01:33:58 +0000

Autonomous Wildfire Response Coordinator: How I Built a Self-Healing Emergency Response System Using LangGraph and Online Replanning

TL;DR

Introduction

What's This Article About?

We’ll cover:

Dynamic Simulation: Building a 2D grid-world where fire spreads based on wind and fuel.
The "Sense-Think-Act" Stream: How agents receive updates and decide when to trigger a "Strategic Reset."
Online Replanning: The logic behind discarding a plan mid-execution and shifting resources.
Tech Stack Nuances: Why I chose specific tools for state management and visualization.

Tech Stack

From my experience, if you're building something that needs to maintain state across complex branching paths, you need a robust framework. Here’s what I used for this experiment:

LangGraph: This was my choice for the core agentic workflow. I think its ability to represent cycles and maintain persistent state is unmatched for "replanning" scenarios.
Pydantic: I used this for structured event definitions. In my experience, strict type safety for agent communication prevents a lot of "hallucination-style" logic errors.
Python 3.10: The backbone of the project.
PIL (Pillow): For generating my technical visual assets (including that optimized GIF you see at the top).
WildfireWorld (Custom): A Python-based simulation engine I wrote to provide the "input stream."

Why Read It?

Let's Design

I put it this way because I think visualization is 50% of the engineering process. Before I wrote a single line of code, I mapped out how I wanted the decision loop to look.

The Core Architecture

In my opinion, a streaming agent needs to be decoupled from the raw data. The "Environment" (the fire) shouldn't care about the "Agent" (the coordinator).

I designed the graph with four primary nodes:

Sensor Ingest: The entry point. It receives "packets" of thermal data.
Threat Analyzer: This is the "brain stem." It doesn't plan; it just screams "FIRE!" when something breaks the perimeter.
Strategy Optimizer: The "prefrontal cortex." It looks at the mess and decides on a new containment boundary.
Dispatcher: The "hands." It executes the tactical movements of tankers and ground crews.

The Decision Flow

I thought about how a human dispatcher works. They don't replan every time a leaf burns. They replan when the fire "jumps" a line. I implemented this via a "Criticality Score" in the state.

Let’s Get Cooking

Now, let's look at the implementation. I've broken this down into the core components that make the "Sense-Think-Act" loop work.

1. The Stochastic Environment: The Math of Chaos

def get_wind_bias(self, from_pos: Coord, to_pos: Coord) -> float:
    fx, fy = from_pos
    tx, ty = to_pos
    dx, dy = tx - fx, ty - fy

    bias = 1.0
    if "N" in self.wind_direction and dy < 0: bias += self.wind_strength
    if "S" in self.wind_direction and dy > 0: bias += self.wind_strength
    if "E" in self.wind_direction and dx > 0: bias += self.wind_strength
    if "W" in self.wind_direction and dx < 0: bias += self.wind_strength

    return bias

2. The Agent State: The Shared Consciousness

class AgentState(TypedDict):
    heat_map: Dict[Coord, str]
    active_plan: List[Coord]
    assets: Dict[str, Dict]
    logs: Annotated[List[str], operator.add]
    criticality: float
    step: int
    replan_required: bool

3. The Analyzer Agent: The Reactive Engine

4. The Strategist Agent: The Proactive Re-pivoting

5. Orchestrating the Chaos

Finally, I tied it all together using LangGraph's StateGraph. I put it this way because I think of the graph as a "Playbook." Each step of the simulation is one execution of the playbook.

I wrote the loop in main.py to be the "Clock" of the system. It pulses every half-second, updating the environment and then asking the agent: "Given this new state, what do we do next?"

The Deep Dive: Why Online Replanning Matters

Challenges and Lessons Learned

I wrote this code, then I thought: "Wait, how do I visualize this so it people can actually see the logic?" This led me down a rabbit hole of GIF optimization.

Ethics and Future Roadmap: The "Human-in-the-Loop"

My future roadmap for this experiment includes:

Multi-UAV Coordination: Simulating multiple assets with independent batteries and refuel cycles.
Topographical Integration: Using real-world GIS data to affect fire spread (e.g., fire spreads faster uphill).
LLM-based Post-Mortem: An agent that analyzes the logs after the simulation to write a "What went wrong?" report.

Let's Setup

If you want to play with this experiment yourself, I've pushed the complete code to GitHub. I wrote the README to be very detailed, so you should be able to get it running in under 2 minutes.

Step by step details can be found at:
https://github.com/aniket-work/WildfireGuard-AI

Let's Run

When you run the project, the terminal output is designed to show the "streaming" nature of the decisions.

python -m main

Closing Thoughts

Stay curious, stay experimental.

Disclaimer

This article is an experimental PoC write-up. It is not production guidance.

Orchestrating Chaos: Autonomous Wildfire Response with Streaming Decision Agents

Aniket Hingane — Mon, 23 Mar 2026 00:40:55 +0000

Autonomous Wildfire Response Coordinator: How I Built a Self-Healing Emergency Response System Using LangGraph and Online Replanning

TL;DR

Introduction

What's This Article About?

We’ll cover:

Dynamic Simulation: Building a 2D grid-world where fire spreads based on wind and fuel.
The "Sense-Think-Act" Stream: How agents receive updates and decide when to trigger a "Strategic Reset."
Online Replanning: The logic behind discarding a plan mid-execution and shifting resources.
Tech Stack Nuances: Why I chose specific tools for state management and visualization.

Tech Stack

From my experience, if you're building something that needs to maintain state across complex branching paths, you need a robust framework. Here’s what I used for this experiment:

LangGraph: This was my choice for the core agentic workflow. I think its ability to represent cycles and maintain persistent state is unmatched for "replanning" scenarios.
Pydantic: I used this for structured event definitions. In my experience, strict type safety for agent communication prevents a lot of "hallucination-style" logic errors.
Python 3.10: The backbone of the project.
PIL (Pillow): For generating my technical visual assets (including that optimized GIF you see at the top).
WildfireWorld (Custom): A Python-based simulation engine I wrote to provide the "input stream."

Why Read It?

Let's Design

I put it this way because I think visualization is 50% of the engineering process. Before I wrote a single line of code, I mapped out how I wanted the decision loop to look.

The Core Architecture

In my opinion, a streaming agent needs to be decoupled from the raw data. The "Environment" (the fire) shouldn't care about the "Agent" (the coordinator).

I designed the graph with four primary nodes:

Sensor Ingest: The entry point. It receives "packets" of thermal data.
Threat Analyzer: This is the "brain stem." It doesn't plan; it just screams "FIRE!" when something breaks the perimeter.
Strategy Optimizer: The "prefrontal cortex." It looks at the mess and decides on a new containment boundary.
Dispatcher: The "hands." It executes the tactical movements of tankers and ground crews.

The Decision Flow

I thought about how a human dispatcher works. They don't replan every time a leaf burns. They replan when the fire "jumps" a line. I implemented this via a "Criticality Score" in the state.

Let’s Get Cooking

Now, let's look at the implementation. I've broken this down into the core components that make the "Sense-Think-Act" loop work.

1. The Stochastic Environment: The Math of Chaos

def get_wind_bias(self, from_pos: Coord, to_pos: Coord) -> float:
    fx, fy = from_pos
    tx, ty = to_pos
    dx, dy = tx - fx, ty - fy

    bias = 1.0
    if "N" in self.wind_direction and dy < 0: bias += self.wind_strength
    if "S" in self.wind_direction and dy > 0: bias += self.wind_strength
    if "E" in self.wind_direction and dx > 0: bias += self.wind_strength
    if "W" in self.wind_direction and dx < 0: bias += self.wind_strength

    return bias

2. The Agent State: The Shared Consciousness

class AgentState(TypedDict):
    heat_map: Dict[Coord, str]
    active_plan: List[Coord]
    assets: Dict[str, Dict]
    logs: Annotated[List[str], operator.add]
    criticality: float
    step: int
    replan_required: bool

3. The Analyzer Agent: The Reactive Engine

4. The Strategist Agent: The Proactive Re-pivoting

5. Orchestrating the Chaos

Finally, I tied it all together using LangGraph's StateGraph. I put it this way because I think of the graph as a "Playbook." Each step of the simulation is one execution of the playbook.

I wrote the loop in main.py to be the "Clock" of the system. It pulses every half-second, updating the environment and then asking the agent: "Given this new state, what do we do next?"

The Deep Dive: Why Online Replanning Matters

Challenges and Lessons Learned

I wrote this code, then I thought: "Wait, how do I visualize this so it people can actually see the logic?" This led me down a rabbit hole of GIF optimization.

Ethics and Future Roadmap: The "Human-in-the-Loop"

My future roadmap for this experiment includes:

Multi-UAV Coordination: Simulating multiple assets with independent batteries and refuel cycles.
Topographical Integration: Using real-world GIS data to affect fire spread (e.g., fire spreads faster uphill).
LLM-based Post-Mortem: An agent that analyzes the logs after the simulation to write a "What went wrong?" report.

Let's Setup

If you want to play with this experiment yourself, I've pushed the complete code to GitHub. I wrote the README to be very detailed, so you should be able to get it running in under 2 minutes.

Step by step details can be found at:
https://github.com/aniket-work/WildfireGuard-AI

Let's Run

When you run the project, the terminal output is designed to show the "streaming" nature of the decisions.

python -m main

Closing Thoughts

Stay curious, stay experimental.

Disclaimer

This article is an experimental PoC write-up. It is not production guidance.

Autonomous Wildfire Response Coordinator: Orchestrating Chaos with Streaming Decision Agents

Aniket Hingane — Mon, 23 Mar 2026 00:40:03 +0000

Autonomous Wildfire Response Coordinator: How I Built a Self-Healing Emergency Response System Using LangGraph and Online Replanning

TL;DR

Introduction

What's This Article About?

We’ll cover:

Dynamic Simulation: Building a 2D grid-world where fire spreads based on wind and fuel.
The "Sense-Think-Act" Stream: How agents receive updates and decide when to trigger a "Strategic Reset."
Online Replanning: The logic behind discarding a plan mid-execution and shifting resources.
Tech Stack Nuances: Why I chose specific tools for state management and visualization.

Tech Stack

From my experience, if you're building something that needs to maintain state across complex branching paths, you need a robust framework. Here’s what I used for this experiment:

LangGraph: This was my choice for the core agentic workflow. I think its ability to represent cycles and maintain persistent state is unmatched for "replanning" scenarios.
Pydantic: I used this for structured event definitions. In my experience, strict type safety for agent communication prevents a lot of "hallucination-style" logic errors.
Python 3.10: The backbone of the project.
PIL (Pillow): For generating my technical visual assets (including that optimized GIF you see at the top).
WildfireWorld (Custom): A Python-based simulation engine I wrote to provide the "input stream."

Why Read It?

Let's Design

I put it this way because I think visualization is 50% of the engineering process. Before I wrote a single line of code, I mapped out how I wanted the decision loop to look.

The Core Architecture

In my opinion, a streaming agent needs to be decoupled from the raw data. The "Environment" (the fire) shouldn't care about the "Agent" (the coordinator).

I designed the graph with four primary nodes:

Sensor Ingest: The entry point. It receives "packets" of thermal data.
Threat Analyzer: This is the "brain stem." It doesn't plan; it just screams "FIRE!" when something breaks the perimeter.
Strategy Optimizer: The "prefrontal cortex." It looks at the mess and decides on a new containment boundary.
Dispatcher: The "hands." It executes the tactical movements of tankers and ground crews.

The Decision Flow

I thought about how a human dispatcher works. They don't replan every time a leaf burns. They replan when the fire "jumps" a line. I implemented this via a "Criticality Score" in the state.

Let’s Get Cooking

Now, let's look at the implementation. I've broken this down into the core components that make the "Sense-Think-Act" loop work.

1. The Stochastic Environment: The Math of Chaos

def get_wind_bias(self, from_pos: Coord, to_pos: Coord) -> float:
    fx, fy = from_pos
    tx, ty = to_pos
    dx, dy = tx - fx, ty - fy

    bias = 1.0
    if "N" in self.wind_direction and dy < 0: bias += self.wind_strength
    if "S" in self.wind_direction and dy > 0: bias += self.wind_strength
    if "E" in self.wind_direction and dx > 0: bias += self.wind_strength
    if "W" in self.wind_direction and dx < 0: bias += self.wind_strength

    return bias

2. The Agent State: The Shared Consciousness

class AgentState(TypedDict):
    heat_map: Dict[Coord, str]
    active_plan: List[Coord]
    assets: Dict[str, Dict]
    logs: Annotated[List[str], operator.add]
    criticality: float
    step: int
    replan_required: bool

3. The Analyzer Agent: The Reactive Engine

4. The Strategist Agent: The Proactive Re-pivoting

5. Orchestrating the Chaos

Finally, I tied it all together using LangGraph's StateGraph. I put it this way because I think of the graph as a "Playbook." Each step of the simulation is one execution of the playbook.

I wrote the loop in main.py to be the "Clock" of the system. It pulses every half-second, updating the environment and then asking the agent: "Given this new state, what do we do next?"

The Deep Dive: Why Online Replanning Matters

Challenges and Lessons Learned

I wrote this code, then I thought: "Wait, how do I visualize this so it people can actually see the logic?" This led me down a rabbit hole of GIF optimization.

Ethics and Future Roadmap: The "Human-in-the-Loop"

My future roadmap for this experiment includes:

Multi-UAV Coordination: Simulating multiple assets with independent batteries and refuel cycles.
Topographical Integration: Using real-world GIS data to affect fire spread (e.g., fire spreads faster uphill).
LLM-based Post-Mortem: An agent that analyzes the logs after the simulation to write a "What went wrong?" report.

Let's Setup

If you want to play with this experiment yourself, I've pushed the complete code to GitHub. I wrote the README to be very detailed, so you should be able to get it running in under 2 minutes.

Step by step details can be found at:
https://github.com/aniket-work/WildfireGuard-AI

Let's Run

When you run the project, the terminal output is designed to show the "streaming" nature of the decisions.

python -m main

Closing Thoughts

Stay curious, stay experimental.

Disclaimer

This article is an experimental PoC write-up. It is not production guidance.