Forem: Manjunath

What Enterprise RAG Is Ready For Today and What Production Deployment Actually Requires

Manjunath — Fri, 22 May 2026 19:31:00 +0000

Enterprise RAG — A practitioner's build log | Post 6 of 6

This series has documented a system built to a specific standard: one where access control is enforced before retrieval scoring, where every answer includes traceable citations, and where the evaluation set measures restricted document leakage rather than retrieval relevance alone.

This final post answers the question that matters most for teams considering this as a foundation: what works today, what needs to be in place before this handles real internal documents in a production environment, and what the gap between those two states actually looks like.

What is fully operational today

Every item below runs locally without external dependencies or provider credentials.

Document pipeline:

Markdown document ingestion with front-matter role metadata (POST /ingest)
SQLite metadata and chunk store with document and chunk count metrics
Lexical retrieval with token cosine similarity scoring

Query and access control:

Role-based candidate filter applied before retrieval scoring
Citation-backed answer generation in deterministic mock mode
RBAC_blocked_count logged per query — tracks how many chunks were filtered
Role derivation from X-API-Key header, preventing request-body role elevation

API and authentication:

FastAPI query API (POST /query) with health probes at GET /health
Local user registration with role assignment (POST /auth/register)
API key creation, listing, and revocation (POST /api-keys, POST /api-keys/{id}/revoke)
SHA-256 key hash storage — raw key never persisted after creation
Management endpoint protection via ADMIN_TOKEN

Evaluation:

Evaluation runner via POST /eval/run — calls live query pipeline, not a mocked path
Four metrics per run: pass rate, restricted leakage count, citation coverage, average latency
Per-case results with expected vs. retrieved document IDs and pass/fail indicators

Operational controls:

Audit log for all administrative actions (GET /audit-logs)
Query log with citations, role, latency, and RBAC metrics
Prometheus-style operational metrics endpoint
Security headers enabled by default, CORS explicitly configured
Structured JSON request logging (JSON_LOGS=true)
In-memory rate limiting per client (RATE_LIMIT_PER_MINUTE)

Infrastructure:

API-backed Streamlit dashboard — no direct database access from the UI
Docker files for containerized runtime validation
GitHub Actions CI workflow
Azure AI Search retrieval adapter implemented and configuration-selectable
OpenAI and Azure OpenAI generation adapters configuration-selectable

The Azure deployment path

The local runtime maps directly to an Azure deployment topology:

Employee → Microsoft Entra ID
↓
Azure Container Apps: API + Dashboard
↓
Azure AI Search (retrieval)
Azure OpenAI (answer generation)
Azure PostgreSQL or Cosmos DB (metadata + audit logs)
Azure Blob Storage (source documents)
↓
Azure Key Vault (secrets)
Application Insights (logs + metrics)

Switching from local to Azure requires environment variable changes only. No code changes. No schema migrations between SQLite and PostgreSQL — the SQLAlchemy layer handles both. Azure mode fails fast when required AZURE_* settings are missing rather than silently degrading to a local fallback.

What production deployment requires beyond the current implementation

Entra ID or OIDC role derivation from identity claims. The local implementation derives role from API key registration. Production deployment should derive role from authenticated identity token claims — not from request parameters or static key registration. The AUTH_PROVIDER=entra configuration path is implemented. End-to-end validation requires a live Azure tenant.

Semantic or hybrid retrieval. The local lexical retriever is deterministic and validates access control correctly. It does not match the retrieval quality of embedding-based semantic search for queries without token overlap with document chunks. Azure AI Search vector and hybrid query modes are the planned production retrieval path.

Distributed rate limiting. The in-memory rate limiter does not share state across multiple API instances. Horizontal scaling requires Redis-backed or API gateway rate limiting.

PII classification and retention policies. The reference document corpus is synthetic. Before ingesting real internal documents — HR records, finance reports, incident logs — the ingestion pipeline should classify content for PII, apply sensitivity labels, and enforce explicit data retention policies for stored queries and generated answers.

Tenant isolation. The current implementation is single-organization. A deployment serving multiple business units with strict data isolation between them requires a tenant isolation layer at the data model and query pipeline level.

Broader evaluation set. The current evaluation set is calibrated for access-control validation across a small synthetic corpus. A production evaluation set requires human relevance labels, answer correctness checks, and a regression threshold integrated into the CI workflow.

An honest assessment

Enterprise RAG demonstrates the architecture that matters for internal knowledge systems: pre-retrieval access control, citation-backed answers, and an evaluation standard that measures restricted document leakage. The local implementation is complete, testable, and fully reproducible without provider credentials.

The gap to production is real and specific. Entra ID integration, semantic retrieval, distributed rate limiting, PII handling, and tenant isolation are well-understood engineering problems with clear solutions. None of them require rethinking the core pipeline — the access control order, the citation model, and the evaluation structure remain intact.

For a team building an internal document Q&A system: the architecture here is worth adopting. The hardening list above is the production backlog, not a reason to start from scratch.

What I would implement next

The highest-impact single item is Entra ID role derivation in production. The entire value of pre-retrieval access control depends on the role being trustworthy. In a local environment with API key role binding, that trust is reasonable. In a production environment with hundreds of employees, role must come from an authenticated identity provider — not from a manually registered key that may become stale when someone changes teams or leaves the organization.

The concrete step: configure AUTH_PROVIDER=entra, map Entra group claims to retrieval roles, and validate that the role filter receives the correct role from the token rather than from the request body. That single change makes the access control guarantee durable against organizational changes.

One question for you

When an employee changes roles or leaves your organization, how quickly does your internal knowledge system stop serving them documents from their previous role? Is that enforced at the identity provider level or at the document system level?

This concludes the Enterprise RAG build log series.

Full series index

The Access Control Gap That Makes Most Enterprise RAG Systems Dangerous
How Enterprise RAG Is Structured: Why Access Control Comes Before Retrieval Scoring
Three Design Decisions That Shaped the Enterprise RAG Retrieval Pipeline
Four Metrics That Actually Tell You Whether Your Enterprise RAG Is Working
Security Controls in Enterprise RAG: Keys, Audit Logs, and the Hierarchy That Prevents Role Elevation
What Enterprise RAG Is Ready For Today and What Production Deployment Actually Requires (this post)

Security Controls in Enterprise RAG: Keys, Audit Logs, and the Hierarchy That Prevents Role Elevation

Manjunath — Thu, 21 May 2026 21:46:00 +0000

Enterprise RAG — A practitioner's build log | Post 5 of 6

A knowledge search system for internal documents carries a specific security obligation: it must not make it easier to access restricted information than going to the document source directly. If an employee can ask a question and receive an answer that reflects finance data they are not authorized to see, the system has introduced a new attack surface that did not exist before.

The security design in Enterprise RAG addresses this through a hierarchy of controls — not a single mechanism, but a layered set that each address a distinct failure point. This post documents what each control does, the tradeoff it accepts, and what remains explicitly unimplemented.

Control 1: API key role binding — preventing request-body role elevation

The query endpoint accepts a user_role parameter in the request body. For unauthenticated local use, this is acceptable. In any shared or externally accessible environment, it is a security problem: a caller who knows the role parameter name can claim any role and retrieve documents outside their authorization.

The control is API key role binding. When a request includes an X-API-Key header, the role context for retrieval is derived from the key holder's registered role — not from the request body. The request body role is ignored entirely.

User registers → POST /auth/register (role assigned at registration)
Admin creates key → POST /api-keys (key is bound to user's role)
Query with key → POST /query + X-API-Key:
Role used for retrieval: key holder's registered role (request body role ignored)

This closes the role elevation path for authenticated callers. A key issued to a user with role: employee cannot retrieve finance documents even if the caller submits user_role: finance in the request body.

API keys are stored as SHA-256 hashes. The raw key is returned once at creation and never again. If a key is lost, it must be revoked and reissued — the stored hash cannot be reversed to recover the original value.

Control 2: Key revocation — making access removal immediate

API key revocation (POST /api-keys/{api_key_id}/revoke) removes the stored hash. A revoked key is rejected by POST /query on the next request — there is no grace period, no cache to drain, no session to expire.

This is operationally important for two scenarios: an employee departure and a compromised credential. In both cases, the recovery action is immediate revocation rather than waiting for a session timeout or token expiry.

The revocation endpoint requires the ADMIN_TOKEN when management protection is enabled, which means the revocation action itself is authenticated. An unauthorized caller cannot revoke another user's key.

Control 3: Management endpoint protection — separating operational from query access

A class of endpoints — ingestion, user registration, key creation, key listing, audit log access, and evaluation runs — are administrative by nature. In any shared or hosted environment, these endpoints must not be accessible without authentication.

When ADMIN_TOKEN is set, these endpoints require X-Admin-Token in the request header:

- `POST /ingest`
- `POST /auth/register`
- `POST /api-keys`, `GET /api-keys`, `POST /api-keys/{api_key_id}/revoke`
- `GET /audit-logs`
- `POST /eval/run`

The query endpoint (POST /query) is governed separately by API key authentication. Query access and management access use different credentials with different scopes. A leaked query key does not grant management access. A leaked admin token does not include the role context of any specific user.

Control 4: Audit logging — making every administrative action traceable

Every management action writes a record to audit_logs: which action was taken, when, and by which admin credential. The audit log is readable through GET /audit-logs with admin authentication.

The current scope of audit logging covers administrative actions. Query logs — which record the question asked, the role used, the citations returned, and the RBAC-blocked chunk count — are stored separately in the query log table and accessible through the dashboard.

Together, these two logs answer the questions a security review will ask: who ingested documents, when were keys created or revoked, what was queried by which role, and what was blocked.

Control 5: Security headers and CORS — default-on, not opt-in

Security headers are enabled by default (SECURITY_HEADERS_ENABLED=true in the base configuration). CORS origins are configured explicitly through CORS_ORIGINS — no wildcard default.

These are baseline controls that cost nothing and prevent a class of browser-based attacks. An API that stores internal document citations should not allow cross-origin requests from arbitrary origins.

For Azure deployment, the CORS origin list should enumerate only the dashboard Container App URL and any internal tools that call the query API directly.

What is not yet implemented

Entra ID and OIDC role derivation from token claims. The AUTH_PROVIDER=entra and AUTH_PROVIDER=oidc configuration paths are implemented and validate bearer JWTs against issuer, audience, expiration, and JWKS signing keys. Role mapping reads from roles, groups, or role token claims and defaults to employee when no role claim is present. End-to-end validation requires a live Azure tenant — it is not testable in the local environment.

Tenant isolation for multi-organization deployments. The current implementation assumes a single organization. Multi-tenant deployment — where organization A's documents are completely isolated from organization B — requires additional data model work and is a documented production consideration.

PII classification for ingested documents. The included reference documents are synthetic. Production ingestion should classify documents for PII content and apply explicit retention policies for prompts and generated answers before storing them.

Distributed rate limiting. The current in-memory rate limiter (RATE_LIMIT_PER_MINUTE) works correctly for single-instance deployments. Multi-instance production deployments require Redis-backed or API gateway rate limiting.

The security posture in plain terms

Enterprise RAG is designed for internal deployment by an engineering team with control over the document corpus, the user registry, and the infrastructure. The controls are appropriate for that context. The gaps — multi-tenant isolation, production-grade PII classification, distributed rate limiting — are appropriate for a larger managed deployment and are documented rather than hidden.

Deploying this system to a shared or externally accessible environment without setting ADMIN_TOKEN is a configuration error, not an implementation gap. The controls are present. The operator must activate them.

Next engineering step

Enable ADMIN_TOKEN in your local .env, attempt to call POST /ingest without the token, and verify the endpoint returns a 401. Then call GET /audit-logs with the admin token and confirm the rejected attempt was logged. That sequence validates that management protection is enforced and that audit logging is capturing the right events.

One question for you

For internal tools that handle restricted documents, do you separate query credentials from management credentials? If a query key were compromised, could an attacker use it to ingest new documents or access the audit log?

Final post in this series: Deployment readiness — what is running locally, what the Azure path requires, and an honest list of what needs to be in place before this system handles real internal documents in production.

Four Metrics That Actually Tell You Whether Your Enterprise RAG Is Working

Manjunath — Thu, 21 May 2026 17:54:07 +0000

Enterprise RAG — A practitioner's build log | Post 4 of 6

RAG evaluation in most implementations stops at one question: did the system retrieve the document that was supposed to be retrieved?

That is the right question for a consumer search product. It is an incomplete question for an enterprise knowledge system where some documents are accessible to all employees and others are restricted by role. In that environment, retrieval correctness is necessary but not sufficient. You also need to know whether restricted documents leaked into answers they should not have influenced.

Enterprise RAG implements four metrics. Each measures a different failure mode. Together they constitute a validation standard that is appropriate for internal knowledge systems handling mixed-sensitivity documents.

The four metrics and what each catches

Pass rate — the percentage of evaluation cases where expected documents were retrieved and forbidden documents were not.

This is the composite metric. A case passes only when both conditions hold simultaneously: the expected documents appear in the citation set and no forbidden documents appear. A system that retrieves all expected documents but also leaks one forbidden document does not pass that case. Pass rate does not reward partial correctness.

Restricted leakage count — the number of evaluation cases that returned at least one forbidden document in the citation set.

This is the most operationally critical metric for an enterprise deployment. A restricted leakage count of zero means the role filter is working correctly across every test case in the evaluation set. Any non-zero value indicates a specific failure to investigate — which case, which document, which role, which query.

Citation coverage — the average number of citations returned per evaluation case.

Low citation coverage indicates that the retrieval system is returning answers without grounding them in source documents. In an enterprise context, an answer without citations cannot be verified, audited, or traced back to a source document. Citation coverage is a proxy for answer auditability.

Average latency (ms) — the mean query execution time across all evaluation cases.

Latency is not a quality metric. It is an operational baseline. If latency increases after a retrieval configuration change — switching from local lexical to Azure AI Search, adding a reranking step, increasing chunk count — the evaluation run captures it. Latency regression during evaluation is a signal worth investigating before deploying the change.

What the evaluation output looks like

Running POST /eval/run returns all four metrics in a single response alongside the per-case results:

{
  "pass_rate": 1.0,
  "restricted_leak_count": 0,
  "citation_coverage": 2.4,
  "average_latency_ms": 38.2,
  "cases": [...]
}

Each case in the result array includes the question, the role used, the retrieved document IDs, and a pass/fail indicator. A failed case shows exactly which expected document was missing or which forbidden document leaked.

The evaluation runner calls POST /query internally for each case, which means it exercises the entire pipeline: authentication, role filter, retrieval, generation, and citation assembly. The metrics reflect actual system behavior, not a mocked retrieval path.

Why the evaluation set structure matters as much as the metrics

The evaluation set (demo/evaluation_set.json) defines each case with three fields alongside the question and role:

Expected document IDs — documents that must appear in the citation set for the case to pass
Forbidden document IDs — documents that must not appear in the citation set

Both lists are required for every case. An evaluation case without forbidden document IDs cannot measure restricted leakage. An evaluation set without any cases involving restricted documents cannot validate access control at all.

Most RAG golden sets I have reviewed define expected documents only. They measure retrieval recall but provide no signal on access control correctness. Adding forbidden document IDs to every test case involving a restricted document is the minimum viable evaluation standard for an enterprise knowledge system.

What the evaluation set does not yet cover

The current evaluation set is optimized for access-control validation and citation tracing. It does not yet cover:

Answer correctness — whether the generated answer is factually accurate relative to the cited documents. This requires human relevance labels or LLM-as-judge evaluation templates.
Semantic retrieval quality — the lexical retriever handles the current evaluation set well. A semantic retrieval configuration may return different ranked results that pass the access-control checks but rank differently by relevance.
Regression thresholds in CI — the evaluation runner is callable from CI. The current setup does not fail a CI run if pass rate drops below a threshold. Adding a threshold check (if restricted_leak_count > 0: fail) to the CI workflow is the practical next hardening step.

These are documented roadmap items, not silent gaps.

Current limits

Evaluation set size is small — calibrated for repeatable local validation, not production-scale coverage.
Answer correctness evaluation requires human labels or an LLM judge. Neither is implemented in the current evaluation runner.
Latency benchmarks reflect the local SQLite retrieval path. Azure AI Search latency profiles will differ.
The evaluation runner requires the ADMIN_TOKEN when management protection is enabled, which prevents accidental evaluation runs in shared environments.

Next engineering step

Review the evaluation cases in demo/evaluation_set.json and count how many include at least one forbidden document ID. If any cases have only expected documents, add a forbidden document ID for a restricted document that should not be retrievable by that role. Then re-run POST /eval/run and verify the leakage count remains zero.

One question for you

Does your current RAG evaluation set include forbidden document IDs alongside expected document IDs? If not, how do you validate that restricted documents are not influencing answers for unauthorized roles?

Next post: Security decisions in Enterprise RAG — how API keys, audit logs, and the order of role enforcement work together to make the system defensible.

Three Design Decisions That Shaped the Enterprise RAG Retrieval Pipeline

Manjunath — Thu, 21 May 2026 16:36:00 +0000

Enterprise RAG — A practitioner's build log | Post 3 of 6

A retrieval pipeline has more design surface than it appears. The technology choices — vector search, LLM provider, storage engine — get most of the attention. The structural choices — where filtering happens, how evaluation is wired, what the dashboard connects to — determine whether the system actually works correctly in a production environment.

This post documents three structural decisions I made in Enterprise RAG, the constraint that drove each one, and the cost I accepted.

Decision 1: Lexical retrieval before semantic — sequencing, not a permanent choice

The default retrieval implementation uses token cosine similarity against a local SQLite chunk store (RAG_RETRIEVAL_PROVIDER=local). Not vector embeddings. Not a managed search index. Lexical scoring.

This was a sequencing decision, not a technology preference.

The constraint: Access control validation requires a deterministic retrieval baseline. If retrieval results vary across runs — because embedding models update, because vector indices are rebuilt, because approximate nearest neighbor algorithms introduce non-determinism — the evaluation set becomes unreliable. A restricted_leak_count of zero means nothing if retrieval is non-deterministic and the same query might return different chunks tomorrow.

Lexical retrieval is fully deterministic. Given the same document corpus and the same query, it returns the same ranked chunk list every time. That makes the evaluation set a reliable regression test rather than a probabilistic snapshot.

The accepted cost: Lexical scoring does not capture semantic similarity. A question about "headcount reduction" will not retrieve a chunk that uses the phrase "workforce restructuring" unless there is token overlap. Semantic retrieval closes that gap — at the cost of determinism in the local validation environment.

The Azure AI Search adapter (RAG_RETRIEVAL_PROVIDER=azure_ai_search) is implemented for production use, where semantic and hybrid query modes are available. The retrieval provider is a configuration switch, not a code change. Switching from local to Azure AI Search does not alter the access control layer, the evaluation runner, or the API surface.

Decision 2: API-backed dashboard — not direct database access

The Streamlit dashboard (dashboard/app.py) connects to the FastAPI API layer, not the database directly. Every dashboard operation — querying documents, fetching metrics, running evaluations, reviewing the citation log — goes through an authenticated API call.

This was not a minor implementation choice. It was a deliberate architectural boundary.

The constraint: A dashboard that reads the database directly cannot be deployed in a containerized or cloud environment without granting the dashboard container database credentials. That creates a credential distribution problem: every new environment where the dashboard runs needs database access, which widens the credential surface.

An API-backed dashboard has a single credential requirement: the DASHBOARD_API_URL and optionally DASHBOARD_ADMIN_TOKEN. The dashboard container never holds database credentials. It holds only the API location and the management token. The API enforces authorization. The database credentials stay with the API container.

The accepted cost: Every dashboard operation adds one network hop compared to direct database access. For a local development setup this is negligible. For a cloud-deployed dashboard querying an API on the same virtual network, it is also negligible. The cost is only relevant if the dashboard is running in a significantly different network zone from the API — which would itself be an unusual deployment topology.

The secondary benefit: the API-backed dashboard tests the public API surface on every dashboard interaction. If the dashboard shows correct data, the API is returning correct data. That is a form of continuous integration that direct database access cannot provide.

Decision 3: Evaluation runner as a live API endpoint — not an offline script

The evaluation runner is exposed as POST /eval/run — a standard API endpoint that runs the evaluation set against the live query pipeline and returns metrics directly.

Most RAG evaluation setups I have seen are offline scripts: pull a golden set, run retrieval, compare results, write a report. The script does not call the production API. It calls the retrieval components directly, often with mocked or simplified versions of the access control layer.

The constraint: If the evaluation script bypasses the access control layer, it cannot detect access control failures. A restricted_leak_count computed by calling the retriever directly — without going through the role filter — will always be zero, regardless of whether the filter is actually working in production.

By routing evaluation through POST /eval/run, which calls POST /query internally, the evaluation runner tests the entire pipeline: authentication handling, role filter, retrieval, generation, and citation assembly. Every evaluation case exercises the same code path that a real user request exercises.

The accepted cost: Live evaluation runs against the production database. In a high-traffic environment, running a large evaluation set could add query load. The mitigation is to run evaluations at low-traffic windows or against a staging environment — not to move evaluation back to a disconnected script.

The current evaluation set is small and optimized for repeatable access-control checks. Extending it with larger golden sets, human relevance labels, and answer correctness checks is a documented roadmap item.

One decision I made explicitly not to make yet

Role metadata is currently embedded in document front matter — each markdown document has a allowed_roles field that specifies which roles can retrieve it. This is correct for a local deterministic environment where document metadata is under engineering control.

In production, role context should come from the identity provider — Entra ID claims or OIDC bearer token attributes — not from request body parameters or document-embedded metadata alone. I did not implement full Entra ID role claim integration because it requires a live Azure tenant to validate. The configuration path is documented and the AUTH_PROVIDER=entra setting is implemented. The end-to-end test of role-from-identity-claim requires a real identity provider.

That is a known gap. It is in the production considerations section of docs/security.md, not hidden in implementation comments.

Current limits

Lexical retrieval does not capture semantic similarity. Queries with no token overlap with document chunks will not retrieve relevant results even when the content is semantically related.
Evaluation set size is calibrated for local access-control validation. Answer quality evaluation — correctness labels, human relevance ratings — is a planned extension.
Entra ID role claim integration requires a live Azure tenant for end-to-end validation. The local implementation uses request-body role parameters, which must not be trusted in production without API key authentication.
The POST /eval/run endpoint requires the ADMIN_TOKEN when management protection is enabled. Evaluation runs in protected environments require the admin credential.

Next engineering step

Add one document to the corpus with allowed_roles: ["finance"], run POST /eval/run, and verify that the new document appears in the blocked count for non-finance evaluation cases. That single test confirms the role filter is reading document metadata correctly and applying it before scoring.

One question for you

Does your internal RAG evaluation pipeline call the same API endpoints that production queries use, or does it call retrieval components directly? If it bypasses the access control layer, does your restricted_leak_count metric actually measure anything?

Next post: The evaluation metrics that matter for enterprise RAG — and why pass rate alone is not enough to validate a system that handles restricted documents.

The Access Control Gap That Makes Most Enterprise RAG Systems Dangerous

Manjunath — Tue, 19 May 2026 20:19:00 +0000

Enterprise RAG — A practitioner's build log | Post 1 of 6

There is a retrieval failure mode that does not show up in accuracy benchmarks: a system that finds the right document but returns it to the wrong person.

Most RAG evaluation frameworks measure whether the retrieved chunks are relevant to the question. Few measure whether those chunks should have been retrievable at all given who asked. In an enterprise context — where the same knowledge base holds HR policy, engineering runbooks, finance forecasts, and security incident reports — that gap is not a minor edge case. It is a fundamental design flaw.

I built Enterprise RAG specifically to treat access control as a first-class retrieval requirement, not an afterthought applied after the answer is generated.

The problem with post-retrieval filtering

The naive approach to document access control in a RAG system is to retrieve first and filter second: score all candidate chunks by relevance, then strip out the ones the user is not allowed to see before generating the answer.

This approach fails in two ways.

It leaks into the answer. A generative model given 20 chunks — including 5 restricted ones — can synthesize information from all 20 even if the restricted chunks are stripped from the citation list before the response is returned. The model has already read the finance forecast before you decided not to show it.

It provides false assurance. Citation filtering gives the appearance of access control without enforcing it in the part of the pipeline that matters. An audit of the response shows no restricted citations. But the answer content may reflect them.

The correct architecture applies access control before retrieval scoring. Unauthorized chunks are excluded from the candidate set entirely. They are never ranked, never passed to the generator, and never cited.

What silently leaks in a typical internal knowledge base

Consider a company running a single internal knowledge base with documents across four categories:

HR and operations — visible to all employees
Engineering runbooks — visible to engineers and above
Finance forecasts and variance reports — visible to finance team and executives
Security incident reports — visible to engineers and security team

A standard question like "What was the revenue variance in Q3?" asked by an employee role against a system with post-retrieval filtering may return an answer that reflects finance data — even if the finance document does not appear in the citation list. The system retrieved it, scored it, passed it to the generator, and then quietly removed it from the citations.

That is not a hypothetical. It is the predictable behavior of any system where retrieval and access control are separate pipeline stages.

The validation test that most teams do not run

Before I built anything, I defined the evaluation test that the system had to pass:

Ask the same question as two different roles. The answer content and the citation list should differ based on role. If an employee and a finance manager ask "What is the Q3 forecast variance?" and receive answers that contain the same information — regardless of whether the citations differ — the access control is not working.

The evaluation set in Enterprise RAG includes explicit forbidden document IDs per test case. The restricted_leak_count metric counts how many evaluation cases returned at least one forbidden document. For a system with correct pre-retrieval access control, that count should be zero.

The screenshot above shows this test passing: the employee role receives an answer grounded in publicly accessible policy documents, while the finance role receives an answer that additionally cites the restricted finance document. Same question. Different retrieval sets. No leakage.

What this changes operationally

The operational implication is that RAG deployment in an enterprise knowledge base requires a different validation standard than consumer or internal-tooling RAG.

Retrieval relevance is not sufficient. You need:

A role model that maps document access to user identity
Pre-retrieval filtering enforced before scoring
An evaluation set that includes forbidden documents per role, not just expected documents
A restricted_leak_count metric tracked alongside pass rate and citation coverage

Without all four, you cannot know whether your system is leaking restricted content. You can only know whether it is retrieving relevant content — which is a different and less important question in an enterprise security context.

Current limits

The current implementation uses lexical retrieval with token cosine similarity scoring. Semantic or hybrid retrieval is a planned extension. Lexical retrieval is accurate enough for the validation workflow but does not match production semantic search quality.
Role metadata is embedded in document front matter. Production deployments should derive role context from Entra ID or an OIDC identity provider, not request body parameters.
The reference documents are synthetic. The evaluation set is calibrated for repeatable local validation, not a production-scale golden set.
Multi-tenant isolation is a documented production consideration. The current implementation is single-organization.

Next engineering step

Run POST /eval/run against the seeded demo data and check the restricted_leak_count. If it is zero, access control is enforced. Then modify the retrieval pipeline to apply scoring before filtering and observe what changes in the evaluation output.

One question for you

If you queried your internal knowledge base with a restricted finance document in the index today, would your evaluation set detect whether that document's content influenced the answer — or only whether it appeared in the citation list?

Next post: The architecture that puts access control before retrieval scoring, and why the order of operations is the entire design.

How Enterprise RAG Is Structured: Why Access Control Comes Before Retrieval Scoring

Manjunath — Tue, 19 May 2026 17:44:48 +0000

Enterprise RAG — A practitioner's build log | Post 2 of 6

The architecture of a RAG system is determined by one decision above all others: where in the pipeline does access control happen?

Get that order wrong and the entire system is structurally insecure regardless of how well the retrieval scores or how accurate the generated answers are. Get it right and the security guarantee holds even as the document corpus grows, roles change, and retrieval algorithms are swapped.

In Enterprise RAG, the order is fixed: role filtering runs before retrieval scoring. That single constraint drives every component boundary in the system.

Request flow: the order that matters

User → POST /query (question + user_role) ↓ Load all candidate document chunks ↓ Apply role filter — unauthorized chunks removed here ↓ Score accessible chunks (token cosine similarity) ↓ Select top citations ↓ Generate answer from cited context only ↓ Persist query metrics and citation log ↓ Return: answer + citations + latency + retrieval metrics

The role filter sits between loading candidates and scoring them. The generator never receives an unauthorized chunk. The citation list cannot include what the generator never saw. The audit log records exactly which chunks were retrieved, which were filtered, and how many were blocked by role.

Component breakdown

FastAPI query API (enterprise_rag/api.py). Receives authenticated requests on POST /query. Derives the retrieval role from the X-API-Key header when present — key-holder role overrides any role supplied in the request body, preventing role elevation by callers. Falls back to request body role for unauthenticated queries.

Role-based candidate filter. Loads document chunks from SQLite, then filters by the allowed_roles metadata field on each chunk. Accepted values include all, engineer, finance, and admin. A chunk with allowed_roles: ["finance", "admin"] is excluded from engineer and employee queries before a single relevance score is computed.

Lexical retriever. Scores the filtered candidate set using token cosine similarity. Because filtering happened upstream, the scorer operates only on chunks the requesting role is authorized to see. Retrieval quality metrics — retrieved chunk count, top retrieval score, RBAC-blocked chunk count — are captured per query.

Mock answer generator. Builds a deterministic answer from the top cited chunks. In LLM_PROVIDER=mock mode this runs without any provider key, making local validation fully reproducible. OpenAI and Azure OpenAI adapters are configuration-selectable for production use.

Query log and metrics store (SQLite). Every query persists: question, role, citations, latency, retrieved chunk count, and RBAC-blocked chunk count. This log is the audit record. It answers not just "what did the system return?" but "what was blocked and why?"

Evaluation runner (POST /eval/run). Runs the evaluation set against the live query pipeline. Reports pass rate, restricted leakage count, citation coverage, and average latency. Because the evaluation runner calls the same /query endpoint as a real user, it tests the entire pipeline end-to-end — not a mocked retrieval path.

API-backed Streamlit dashboard. The dashboard calls the FastAPI layer rather than reading the database directly. This is a deliberate design choice: the same API boundary used for the UI can be retained for containerized or Azure deployment without changes.

How Azure AI Search fits the same pipeline

The local retrieval implementation uses lexical scoring against SQLite chunks. The Azure AI Search adapter replaces the retriever component while keeping the same access control boundary:

Azure AI Search filter (before results are returned): allowed_roles/any(role: role eq 'all' or role eq '<user_role>')

The filter is applied server-side at the search index before results are returned to the application. The application layer role filter provides defense in depth, but the primary enforcement happens at the index level when Azure AI Search is the retrieval provider.

This is the correct architecture for a production deployment: access control enforced at two layers — index filter and application filter — so a misconfiguration at one layer does not compromise the other.

The local-to-Azure configuration switch

Every component in the local architecture has a direct Azure counterpart:

| Local | Azure |
|||
| SQLite metadata and chunks | Azure PostgreSQL or Cosmos DB |
| Local markdown files | Azure Blob Storage |
| Lexical retriever | Azure AI Search |
| Mock answer generator | Azure OpenAI |
| Local API and dashboard | Azure Container Apps |
| Environment variables | Azure Key Vault |
| print / file logs | Application Insights |
| Local users and hashed keys | Microsoft Entra ID |

Switching between local and Azure requires only environment variable changes. No code path changes, no schema migrations between local and PostgreSQL — SQLAlchemy handles both.

Current limits

The local retriever uses lexical scoring. Semantic similarity and hybrid retrieval are planned Azure AI Search extensions. Lexical scoring is sufficient for deterministic local validation but will not match embedding-based relevance in production.
The dashboard is single-instance. Distributed session state and multi-instance deployments require additional coordination.
Rate limiting is in-memory per instance. Multi-instance production deployments require Redis-backed or API gateway rate limiting.
Tenant isolation for multi-organization deployments is a documented production consideration, not yet implemented.

Next engineering step

Query the system as employee role for a question that has a known restricted finance document in scope. Inspect the rbac_blocked_count field in the query log. Confirm that the blocked count is non-zero — meaning the filter ran and excluded chunks — before the answer was generated.

One question for you

In your current RAG architecture, at what stage does access control run — before chunk scoring, after chunk scoring, or only at the citation display layer? Do you have a metric that tracks how many chunks were filtered per query?

Next post: Three design decisions that shaped the retrieval pipeline — why lexical retrieval before semantic, why API-backed dashboard over direct database access, and why evaluation is built into the API rather than run as a separate offline script.

Building Production-Grade Human-in-the-Loop Workflow Automation with LangGraph

Manjunath — Mon, 18 May 2026 19:53:23 +0000

The Problem With Enterprise Approval Workflows.

Most enterprise approval workflows are not systems. They are sequences of emails.

A compliance review is filed. Someone forwards it to a reviewer. The reviewer replies. A manager is CC'd. Someone updates a spreadsheet. Three days later, the spreadsheet has a new column that no one agreed to add.

When something goes wrong - a decision is disputed, an auditor asks questions, a regulator wants a decision log - the answer is in someone's inbox. If the reviewer has left the company, the answer may not be recoverable at all.

The pattern breaks down further when workflows cross systems. A procurement approval might require a vendor check, a budget validation, a legal review, and a final sign-off. Each step is handled by a different team, in a different system, with no shared state. When step three fails, starting over means starting from step one.

The technical problem is the absence of persistent, structured workflow state. A workflow that lives in email has no state. It can't be paused and resumed. It can't be audited. It can't be recovered if a step fails.
This post covers how I built a platform to solve this using LangGraph, FastAPI, and SQLite - with a production path to Azure.
Why LangGraph
The core requirement was a workflow engine that could pause at a human decision point and resume from that exact position - surviving server restarts between the pause and the resume.
LangGraph's StateGraph is well-suited to this because it separates the workflow structure from the workflow state. The graph is a set of nodes (agent functions) and edges (routing logic). The state is a typed dictionary that flows through the graph. Checkpointing saves the state at each transition.

Two specific LangGraph primitives made this practical:

interrupt_before: The graph can be compiled with a list of node names that should trigger an interrupt before execution. When the graph reaches one of those nodes, it halts, persists the current state to the checkpointer, and returns control to the caller. The graph resumes when explicitly invoked again with the same thread ID.
AsyncSqliteSaver: A persistent checkpoint backend that writes graph state to SQLite. Unlike the default MemorySaver, which is process-local, AsyncSqliteSaver persists across server restarts. The same checkpoint file is readable by any process with the correct connection string.
These two primitives are the foundation of the human-in-the-loop pattern described in the next section.
The Checkpoint Pattern
The most common mistake in stateful workflow systems is assuming process memory is durable.
If the workflow is running inside a long-lived process, and that process restarts, the workflow state is gone. In practice, this means every server restart, every deployment, and every crash silently kills every in-flight workflow.
The fix is to write state to a persistent store at every transition, not just at the end.

from langgraph.checkpoint.aiosqlite import AsyncSqliteSaver
async with AsyncSqliteSaver.from_conn_string(CHECKPOINT_DB_URL) as checkpointer:
 graph = workflow_module.build_graph(checkpointer=checkpointer)
 result = await graph.ainvoke(input_state, config={"configurable": {"thread_id": workflow_id}})

Every call to ainvoke with the same thread_id resumes from the last persisted checkpoint. If the server restarts between the risk scoring step and the human review step, the next invocation picks up from risk scoring output - not from the beginning.
In production, CHECKPOINT_DB_URL is a Postgres connection string. The application code does not change.
The Human Pause: Interrupt vs Polling
The conventional approach to human-in-the-loop is a polling loop: an agent writes a "pending review" flag to a database, and a background process polls until a human updates the flag.
This has two failure modes. First, the polling process itself is a point of failure - if it crashes, the workflow never resumes. Second, concurrent reviewers can both see "pending" and submit conflicting decisions before either decision is reflected.
The interrupt approach eliminates both.

graph = builder.compile(
 checkpointer=checkpointer,
 interrupt_before=["decision_agent"]
)

When the graph reaches decision_agent, it halts. The caller receives control. The workflow state is in the checkpoint store. No polling. No flags. No background process.
Resume happens via a single API call:

# Human submits decision via POST /api/workflows/{id}/decide
await graph.aupdate_state(
 config={"configurable": {"thread_id": workflow_id}},
 values={"human_decision": decision, "decision_notes": notes}
)
result = await graph.ainvoke(None, config={"configurable": {"thread_id": workflow_id}})

The graph loads the checkpoint, applies the updated state, and continues from decision_agent. The reviewer's decision, identity, and timestamp are written to the audit trail before the graph resumes.
Immutable Audit Trails
An audit trail that can be modified after the fact is not an audit trail.
Every event in this platform is appended to a log. No update operations. No delete operations. The audit logger exposes a single method:

await audit_logger.log(
 workflow_id=workflow_id,
 stage="risk_scoring",
 actor="SYSTEM",
 event_type="RISK_SCORE_COMPUTED",
 data={"risk_score": 74, "reasoning_summary": "Three rule failures in financial controls section"}
)

The data field is intentionally sanitized before logging. Document content - extracted text, raw field values, personal data - is never written to the audit trail. The log records what the system did (risk score computed, rule evaluated, human decision submitted) and the structured metadata that supports that record. Not the raw content that was processed.
This matters when the audit trail is itself subject to data retention requirements. A log that contains full document text is subject to the same retention and access controls as the document. A log that contains metadata is not.
Pluggable Workflow Registry
The architecture has a single orchestration engine and multiple workflow modules. Adding a new workflow requires one new folder in workflows/, implementing a standard interface:

class WorkflowModule:
 name: str
 description: str
def build_graph(self, checkpointer) -> StateGraph:
 …
def get_input_schema(self) -> dict:
 …

The registry discovers and loads modules at startup. The API, the dashboard, and the audit trail require no changes when a new workflow is added.
The platform currently ships with two modules: compliance review and procurement. Both were added without modifying the orchestration engine. The third module - whatever it is - will be added the same way.
What This Enables
The compliance review workflow demonstrates the pattern at its most structured. Six automated stages produce a risk score and a rule evaluation before a human reviewer sees the workflow. The reviewer sees the complete automated analysis - not a summary, the full output - and submits a decision. The workflow generates a compliance certificate or a rejection report. The audit trail records every stage from document intake to certificate generation.
The same pattern applies to any workflow where:

Multiple sequential steps process the same input
A human decision is required at a defined checkpoint
The decision and its context must be traceable after the fact Vendor onboarding, contract review, budget approval, incident escalatio all of these map cleanly to the same architecture. The platform is local-first, with a documented path to Azure: SQLite to Postgres, local file storage to Blob Storage, API keys to Key Vault, uvicorn to Container Apps. One environment variable change per component.

Conclusion

The technical foundation for reliable enterprise workflow automation is not complicated. Persistent state, genuine human-in-the-loop interrupts, and an immutable audit log cover the majority of the requirements in regulated industries.
The difficulty is in the details: checkpoints that survive restarts, interrupt/resume that doesn't require polling, audit logs that capture decisions without capturing personal data.
The full platform, including architecture diagrams, state machine documentation, a working demo, and 56 passing tests, is at:
https://github.com/manjunath-hanmantgad/multi-agent-orchestration

Built with LangGraph, FastAPI, SQLite (Postgres-ready), and Tailwind CSS.