Forem: SKasagar

Privacy-first RAG on Cloudflare's edge — here's everything I changed from the naïve baseline published:

SKasagar — Wed, 22 Apr 2026 19:54:22 +0000

Live app: localmind.caseonix.ca · originally posted as a 3-part series at caseonix.ca/notes

I built a privacy-first document intelligence platform called LocalMind. Documents are uploaded, classified, reviewed, and searched at the Cloudflare edge — content never leaves Cloudflare. It uses Vectorize for vector similarity, Workers AI for embeddings, and Google Gemma 4 (26B) for summaries, reviews, comparisons, and chat. Multi-tenant with namespace isolation per team.

This post stitches together my three lab notes on the RAG side of LocalMind:

The pipeline — how chunking → embedding → vector index → retrieval → generation is wired.
Quality improvements — eleven things I changed to move beyond the naïve "split, embed, top-K, stuff" baseline.
The NLP layer — the classical and LLM-based NLP that runs alongside RAG (PII, NER, table flattening, classification, structured analysis).

If you're shipping RAG on the edge, want to compare notes on contextual chunking, or are doing PIPEDA-aware document handling for a Canadian deployment — this is a long one but the diagrams should help.

Part 1 — The pipeline

How I wired retrieval-augmented generation end-to-end. Everything runs on the Cloudflare edge — Workers AI for inference, Vectorize for ANN, D1 for the canonical chunk text, R2 for the original blob. I use teamId as the Vectorize namespace on every read and write to keep tenants isolated.

Ingest pipeline (write path)

I put the whole ingest pipeline inside a per-team Durable Object so it can run async, hold rate-limit state, and use alarms to verify the vector index.

flowchart LR
    A[R2 file] --> B[parseDocument<br/>PDF/DOCX/XLSX/OCR]
    B --> C[flattenTables<br/>rejoin headers]
    C --> D[chunkText<br/>≤512 tok, 50 overlap]
    D --> E[redactPII<br/>regex + LLM]
    E --> F[(D1: document_chunks<br/>display copy)]
    E --> G[generateChunkContext<br/>Gemma — 1-2 sentence header]
    G --> H[generateEmbeddings<br/>BGE-Small, batch 32]
    H --> I[(Vectorize.upsert<br/>namespace=teamId)]
    I --> J[DO alarm — verify<br/>retry 5× / 90s]

Orchestration lives in a per-team Durable Object's processDocument. The main twist is a dual-text split: D1 stores the redacted-but-otherwise-original chunk that users see in citations, and Vectorize stores a contextually enriched version where Gemma 4 has prepended a 1-2 sentence header that says where the chunk fits in the document. The header improves recall for chunks that would otherwise be too short or too back-referencing to retrieve on their own, and it never shows up in the user-facing citation.

Chunking

The chunker is small and deterministic. It splits the text on paragraph breaks (\n{2,}) and packs whole paragraphs into a buffer until the buffer would go over MAX_CHUNK_TOKENS (512 tokens, approximated as 4 chars/token). When a paragraph won't fit, I flush the buffer and carry a 50-token tail forward as overlap so context isn't cut off in the middle of a thought. If a single paragraph is bigger than the cap on its own, I fall back to a sentence regex (/[^.!?]+[.!?]+\s*/g) and pack sentences instead. No tokenizer call, no LLM — paragraph structure is the main signal, and the token math is conservative on purpose.

Before chunks ever touch D1 or Vectorize, I run them through two PII passes (regex + LLM). The redaction is applied to both the display copy in D1 and the version that gets embedded — privacy was a hard requirement for me, so nothing identifiable leaves the chunk pipeline. (Detail in Part 3.)

Embedding

The embedding service is small on purpose — a single short module with two functions. It calls @cf/baai/bge-small-en-v1.5 via Workers AI for both the document side and the query side. On the query path I prepend BGE's asymmetric prefix "Represent this sentence for searching relevant passages: ". That's how BGE was trained on the query side, and using it raises cosine similarity for relevant chunks above the noise floor.

In the DO, I batch embeddings at 32, then call Vectorize.upsert in batches of 100. Vectorize is eventually consistent, so once the upsert returns I schedule a Durable Object alarm 90 seconds out. The alarm re-queries the first and last chunk IDs to check they're indexed, retries up to 5 times, and only marks the document ready for search once both ends are visible. If verification fails after the final retry, I flip the document to error so the user knows search won't work for it.

Retrieval & generation (read path)

flowchart TB
    Q[user query] --> E1[expandQuery<br/>Gemma → 2-3 variants]
    E1 --> E2[generateQueryEmbedding ×N<br/>BGE prefix applied]
    E2 --> V[Vectorize.query ×N<br/>namespace=teamId, topK=15]
    V --> M[merge + dedupe<br/>max score per chunk]
    M --> H[hydrate from D1<br/>drop sim < 0.3]
    H --> R[rerankChunks<br/>Gemma 1-5, keep ≥3, top 7]
    R --> C[buildContextFromChunks<br/>≤6000 tok, ≤2/doc team-wide]
    C --> S[synthesizeAnswer<br/>Gemma + history + cite]
    S --> A[answer + sources]

Step by step:

Query expansion. Gemma rewrites the user's query into 2-3 variants under a JSON schema. The original is always kept, capped at 3. If the LLM call fails I fall back to the original query only.
Multi-vector search. I embed each variant with the BGE query prefix and dispatch all of them in parallel to Vectorize.query, scoped to the team's namespace with topK = 15.
Merge and dedupe. I union results by chunk ID and keep the highest cosine score across variants.
Hydrate and threshold. I load chunk text from D1 in a single inArray query, join it to document names, and drop anything below MIN_SIMILARITY = 0.3. (BGE-Small puts truly relevant chunks in the 0.35-0.55 band on my corpus, so 0.3 is the tuned floor.)
LLM rerank. I send the top 15 candidates to Gemma with a strict 1-5 relevance rubric, again JSON-schema constrained. I cut anything below 3 and keep the top 7. If fewer than 2 chunks pass, I fall back to the vector ranking.
Context assembly. I concatenate chunks with [N] filename headers and --- separators, stopping at MAX_CONTEXT_TOKENS = 6000. For team-wide searches I cap each document at 2 chunks so one chatty document can't crowd out the others.
Synthesis. The system prompt binds Gemma to the supplied context, requires [Document: filename] citations, and forbids external knowledge. Conversation history is walked newest-first and trimmed at 1500 tokens.

Tunables — single source of truth

Knob	Value	What it controls
`MAX_CHUNK_TOKENS`	512	chunker upper bound
`CHUNK_OVERLAP_TOKENS`	50	tail carried between chunks
`TOP_K`	15	vectors retrieved per query variant
`MIN_SIMILARITY`	0.3	post-vector floor
`QUERY_EXPANSION_MAX`	3	LLM rewrites per query
`RERANK_TOP_K`	7	chunks kept after rerank
`RERANK_MIN_SCORE`	3	Gemma relevance cutoff (1-5)
`MAX_CONTEXT_TOKENS`	6000	budget for synthesis prompt
`MAX_CHUNKS_PER_DOC_TEAM_SEARCH`	2	per-doc cap in team-wide context

Models

Embedder: @cf/baai/bge-small-en-v1.5 (BGE-Small, 384-dim) on Workers AI.
Generator / reranker / classifier: @cf/google/gemma-4-26b-a4b-it, configurable.
Vector index: Cloudflare Vectorize, partitioned by teamId namespace.
Chunk store: Cloudflare D1 (SQLite) via Drizzle.
Blob store: Cloudflare R2.

Part 2 — How I improved RAG quality

A naïve "split by N tokens, embed, top-K cosine, stuff into prompt" pipeline is the baseline most RAG tutorials give you. I moved away from that baseline at almost every layer of LocalMind because each step had a real failure mode in my testing.

Baseline vs. what I built

flowchart TB
    subgraph Baseline["Naïve baseline RAG"]
        B1[fixed-window<br/>chunking] --> B2[embed<br/>raw chunks]
        B2 --> B3[single-query<br/>top-K cosine]
        B3 --> B4[stuff into<br/>prompt]
    end
    subgraph LocalMind["LocalMind RAG"]
        L1[paragraph-aware chunk<br/>+ overlap<br/>+ table flatten<br/>+ PII redact] --> L2[contextual prepend<br/>+ BGE query prefix<br/>+ embed]
        L2 --> L3[query expansion ×N<br/>+ multi-vector merge<br/>+ similarity floor 0.3<br/>+ Gemma rerank 1-5]
        L3 --> L4[token-budgeted context<br/>+ per-doc fairness cap<br/>+ citation-bound synthesis<br/>+ history trim]
    end
    Baseline -.evolved into.-> LocalMind

Below, each change is shown as the problem it fixes.

Indexing-side improvements

flowchart LR
    R[raw text] --> C1[paragraph-aware<br/>chunking]
    C1 --> C2[table<br/>flattening]
    C2 --> C3[PII redaction<br/>before embed]
    C3 --> C4[contextual<br/>prepend]
    C4 --> C5[BGE query/doc<br/>asymmetry]
    C5 --> V[(Vectorize)]

Paragraph-aware chunking instead of fixed windows. Naïve chunkers slice every N characters, which often cuts across sentence and paragraph boundaries and embeds half-thoughts. My chunker packs whole paragraphs first, only falls back to sentence splitting when a single paragraph overflows, and carries a 50-token overlap on every flush so cross-boundary references survive. Embeddings stay tied to coherent semantic units, and overlap covers the cases where a discussion does cross paragraphs.

Table flattening before chunking. PDFs and DOCX exports lose tabular structure on extraction — column headers end up in row 1 and the data rows below them lose all meaning ("$4.2M" with no header is useless to a retriever). My table flattener walks the parsed text and re-attaches headers to each row so a query like "Q3 revenue" has something to match against in the embedding space. (Detail in Part 3.)

Contextual chunking (the biggest single win). I applied Anthropic's contextual-retrieval idea here. For each chunk, Gemma writes a 1-2 sentence header that says where the chunk fits in the broader document — for example, "This chunk is from the risk-factors section of Acme's 2024 10-K and discusses supply-chain exposure to…". I prepend that header to the chunk before embedding, so the vector encodes both the local content and its global context. The display copy in D1 stays untouched, so users still see the clean original text in citations.

flowchart LR
    OC[original chunk] --> D1[(D1 — display copy)]
    OC --> CTX[Gemma writes<br/>1-2 sentence header]
    CTX --> ENR[contextualized chunk<br/>= header + original]
    ENR --> EMB[BGE embed]
    EMB --> VZ[(Vectorize — retrieval copy)]

This single change saves a lot of short chunks and back-referencing chunks ("the company also reported…" — which company?) that vanilla embeddings can't disambiguate.

BGE asymmetric query encoding. BGE-Small was trained with a specific instruction prefix on the query side. Most pipelines ignore it. I prepend "Represent this sentence for searching relevant passages: " to every query before embedding, which matches the model's training distribution and raises cosine similarity for relevant chunks above the noise level.

PII redaction before embedding. Privacy-first is the headline requirement of LocalMind, but it's also a retrieval lever: I redact chunks before they're embedded, so PII never gets encoded into the vector space. A side benefit is that chunks cluster by topic instead of by random identifiers like phone numbers.

Retrieval-side improvements

flowchart TB
    Q[user query] --> QE[query expansion<br/>2-3 variants]
    QE --> MV[multi-vector search<br/>parallel topK=15]
    MV --> MD[merge + dedupe<br/>max score per chunk]
    MD --> TH[similarity floor<br/>≥ 0.3]
    TH --> RR[LLM rerank<br/>Gemma 1-5]
    RR --> CT[token-budgeted context<br/>+ per-doc fairness cap]
    CT --> SY[citation-bound<br/>synthesis]

Query expansion + multi-vector search. A single query phrasing often misses chunks that use different vocabulary ("revenue" vs. "top line", "termination" vs. "wrongful dismissal"). I have Gemma rewrite the query into 2-3 alternative phrasings, embed each one with the BGE prefix, and run all variants in parallel against Vectorize. I merge results by chunk ID, keeping the max score across variants, so a chunk surfaced by multiple phrasings gets credit but isn't double-counted. Recall goes up without a precision hit.

Tuned similarity floor. A naïve top-K returns K chunks no matter what — including garbage when the corpus has nothing relevant. I profiled BGE-Small's score distribution on my actual corpus: relevant chunks land at 0.35-0.55, noise sits below 0.3. My MIN_SIMILARITY = 0.3 floor cuts off the noise band, and if zero chunks pass I short-circuit to "I couldn't find relevant information" instead of letting the model hallucinate from low-quality context.

Two-stage retrieval — vector then LLM rerank. Cosine similarity is fast but coarse. I retrieve a wide candidate set (top 15 per variant) then hand it to Gemma with a strict 1-5 relevance rubric and a JSON-schema output. Anything scoring below 3 is dropped; I keep the top 7. Reranking boosts chunks that answer the question over chunks that just share vocabulary with it, which is the failure mode pure-vector retrieval is most prone to. I added a fallback: if fewer than 2 chunks pass the rerank threshold I revert to vector ordering rather than ship a near-empty context.

Per-document fairness cap on team-wide search. Without a cap, one document with high lexical overlap can take over the top 7 and crowd out other documents with more diverse but still-relevant context. MAX_CHUNKS_PER_DOC_TEAM_SEARCH = 2 enforces breadth across the result set when the search spans multiple documents.

Token-budgeted context assembly with citations. The context builder enforces a 6000-token cap and writes each chunk with a [N] filename header and --- separator. The synthesis system prompt binds Gemma to the supplied context, forbids external knowledge, and requires [Document: filename] citations — so answers are grounded and checkable, not filled in from the model's training data.

Conversation history trim. Long chat sessions can blow the context window. I walk the history newest-first and trim it at MAX_HISTORY_TOKENS = 1500, so older turns drop off but the recent thread stays intact.

Reliability improvements

Eventual-consistency handling. Vectorize is eventually consistent — the upsert returning successfully doesn't mean the chunks are queryable yet. A document marked "ready" before its vectors are indexed produces empty searches and looks like a quality bug from the user's seat. I sample first/last chunk IDs after upsert via a Durable Object alarm, retry on failure (5 attempts × 90s), and only flip status to ready once both ends are visible — or to error if they never show up.

Fallback at every LLM call. Each LLM step (query expansion, reranking, synthesis) has an explicit fallback path. Expansion failure → original query only. Rerank failure or fewer than 2 passing chunks → vector ranking. Synthesis failure surfaces as a clear error rather than partial output. The pipeline never quietly produces lower-quality results without saying so in the logs.

Score-keeping

Failure mode I observed	What I changed
Half-sentence chunks	Paragraph-aware chunker + overlap
Detached table values	Table flattener re-attaches headers
Back-referencing / short chunks unfindable	Contextual prepend before embed
Weak query embeddings	BGE instruction prefix
Vocabulary-mismatch misses	Query expansion + multi-vector merge
Low-quality top-K	0.3 similarity floor
Vocab-match ≠ answer-match	Gemma rerank with 1-5 rubric
One doc dominates results	Per-document fairness cap
Hallucination	Citation-bound system prompt
Long chats blow context	History token cap
Vectors not yet indexed	DO alarm verification loop

Part 3 — The NLP layer alongside RAG

RAG is the headline feature, but a fair amount of classical and LLM-based NLP runs alongside it during ingest. Each pass produces structured fields I persist on the documents row and surface in the UI, and several of them feed back into retrieval quality.

NLP fan-out during ingest

flowchart TB
    P[parseDocument<br/>PDF/DOCX/XLSX/OCR] --> TF[tableFlattener<br/>re-attach headers]
    TF --> R1[regex PII redactor<br/>SSN/SIN/CC/addr/...]
    R1 --> R2[LLM PII detector<br/>medical/financial/...]
    R2 --> CL[redacted text]
    CL --> N1[regex NER<br/>email/phone/date/$/%/url]
    CL --> N2[document classifier<br/>contract/financial/policy/<br/>hr/invoice/general]
    CL --> N3[document analysis<br/>title + description + topics +<br/>key points + people + orgs +<br/>sentiment + risk flags]
    CL --> N4[document review agent<br/>per-type checklist + risks]
    N1 --> M[merge entities<br/>regex + LLM]
    N3 --> M
    M --> DB[(D1: documents row)]
    N2 --> DB
    N3 --> DB
    N4 --> DB

Layered PII redaction (regex + LLM)

Privacy is the headline requirement of LocalMind, so PII handling has two layers — deterministic patterns first, LLM second.

flowchart LR
    T[chunk text] --> R[regex pass<br/>12 categories]
    R --> P[LLM pass<br/>10 sensitive categories]
    P --> O[fully redacted text]
    R -.detected types.-> M[piiTypes set]
    P -.detected categories.-> M
    M --> DB[(documents.pii_types)]

Regex pass

I wrote a pattern bank for the 12 deterministic PII categories: SSN, SIN, credit card, IP address, DOB, Canadian postal code, US ZIP, street address, PO box, health card, passport number, driver's license. Three redaction modes: full ([SSN REDACTED]), partial (***-**-1234), and strip (delete the match).

Partial mode is the default. It keeps the last few digits or characters visible so a human reviewer can still cross-reference (****-****-****-9876) without exposing the full value. Each pattern has its own masking function so the partial form makes sense for the data type — for a postal code I keep the FSA, for an address I keep the street type word, for a phone-like field I keep the last 4.

LLM pass

Regex can't catch unstructured sensitive PII like "diagnosed with Type 2 diabetes" or "donated to the Liberal Party". For that I run Gemma over the regex-cleaned text with a JSON schema that enforces 10 sensitive categories: medical condition, medication, financial detail, ethnic/racial, religious, sexual orientation, criminal/legal, biometric, family relationship, political opinion.

The model returns exact text spans (not paraphrases), which I apply via replaceAll. That keeps redaction lossless and the labels predictable ([MEDICAL CONDITION REDACTED], etc). I framed the categories against PIPEDA / PHIPA so the output maps to a real Canadian compliance posture.

Both passes union their detected types into the documents.pii_types array, which the UI shows as a chip strip.

Hybrid named-entity recognition

Two NER passes run on every document and their outputs are merged before persisting.

Regex NER

Six deterministic categories with a pattern bank:

Type	Pattern handles
`email`	standard RFC-ish addresses, lowercased on normalize
`phone`	NANP with optional country code, separators stripped on normalize
`date`	ISO `2024-03-15`, slashed `3/15/24`, written `March 15, 2024`, fiscal `Q3 2024`
`currency`	`$`, `€`, `£` with K/M/B suffixes (`$4.2M`)
`percentage`	`12.5%`
`url`	`http(s)://...`

After matching, each (type, normalized-value) pair is counted, and the result is sorted by frequency. The most-mentioned entities float to the top of the entities list — useful for "what is this document about" at a glance.

LLM NER

The document analysis pass also returns people and organizations arrays — the fuzzy entities that regex can't reliably extract. These get merged with the regex entities and stored on documents.entities. So the entities list ends up being a hybrid: deterministic stuff with frequency counts plus LLM-extracted names.

Table structure recovery

This is the bit I'm proudest of, and it's a pure-NLP problem disguised as text processing.

Document parsers (PDF text extraction, XLSX) destroy table semantics: column headers end up on one line, data rows on subsequent lines, and after chunking the headers and values are completely disconnected. A row that originally read "Q3 2024 | Revenue | $4.2M | +18%" comes out of the parser as a strip of numbers with no header context, which is useless for both retrieval and reading.

The flattener detects two table shapes and re-emits each row as a Header: Value | Header: Value | … line.

flowchart TB
    L[parsed lines] --> D{table?}
    D -->|"CSV: 3+ comma fields, 2+ rows"| FC[flatten CSV]
    D -->|"whitespace: 3+ cols, 3+ rows"| FW[flatten whitespace]
    D -->|no| K[keep line]
    FW --> G["skip if numeric headers<br/>or wide financial"]
    FC --> O["Header1: Val1 #124; Header2: Val2 #124; ..."]
    FW --> O
    G --> K
    K --> R[output text]
    O --> R

The interesting parts:

CSV detection guards against $32,100 false positives — if any field has a run of 2+ whitespace inside it, the commas are probably inside numbers, not delimiters. The detector rejects.
Whitespace-aligned table detection finds gap regions (positions where 2+ consecutive characters are spaces across all rows) to infer column boundaries. It needs at least 3 rows so single-line false positives don't slip through.
Financial-statement guard. Wide whitespace tables where the "headers" are mostly numeric (column dates, $ totals) get skipped — flattening makes them harder to read, not easier. Narrow invoice/pricing tables (3-4 columns) still get flattened.
Sanity check. If the flattened output is more than 2× the input length, something went wrong with boundary detection — fall back to the original.

Each flattened row becomes a self-contained semantic unit that embeds well, so a query like "Q3 revenue" can match against the actual data.

Document classification

A single Gemma call buckets each document into one of six types: contract, financial, policy, hr, invoice, general. Falls through to general if the model misbehaves.

The classification is the routing key for the downstream review agent — each document type gets a different review checklist. So classification isn't just a UI tag, it's a control-flow decision.

Document analysis (summary + topics + sentiment)

A single Gemma call (generateDocumentAnalysis) produces seven structured fields under one JSON schema:

Field	Type	What it is
`title`	string	descriptive title
`description`	string	1-2 sentence overview
`keyTopics`	string[]	3-5 short topic strings
`keyPoints`	string[]	3-6 bullet points of the most important facts
`people`	string[]	person names (LLM NER)
`organizations`	string[]	company/org names (LLM NER)
`sentiment`	object	structured sentiment, see below

Sentiment itself is structured:

{
  "label": "positive | negative | neutral | mixed",
  "confidence": 0.0,
  "riskFlags": ["legal risk", "compliance concern", "..."]
}

The free-form riskFlags array is the design choice I want to highlight. Instead of forcing the model to pick from a fixed risk taxonomy, I let it call out concerning content in its own words, then let the UI surface those flags. That way the model can flag a "supply-chain concentration risk" without needing me to predefine that bucket.

There's a plain-text fallback path: if the schema-constrained call fails, I try a simple "summarize this document in 2-3 sentences" prompt and persist that as the description. I'd rather degrade gracefully than leave a document with no summary at all.

Document review agent

Layered on top of classification, the review agent runs a per-document-type review (checklist items + extracted fields + risk brief), persisted as review_json. This is where domain knowledge gets injected — different rubrics per document type rather than a one-size-fits-all summary. A contract gets a contract checklist; an invoice gets an invoice checklist.

JSON-schema constraints everywhere

Every LLM call I wrote (PII detection, query expansion, reranking, document analysis, review) uses Workers AI's response_format: { type: 'json_schema', json_schema: ... } to force well-typed output.

This matters for three reasons:

No regex-parsing of model output. Either JSON.parse succeeds and the schema validates, or I fall back to a simpler path.
Failure modes are explicit. A schema validation failure is a clear log line, not a silent quality regression.
Prompts stay shorter. The schema documents the contract, so the system prompt doesn't need to repeat "respond with JSON in the format…".

Where it all lands

Everything above writes to columns on the documents row:

Column	Source
`summary`	`generateDocumentAnalysis` (title + description + key topics + key points)
`pii_redacted`, `pii_types`, `pii_count`	regex + LLM PII passes
`entities`	regex NER + LLM NER, merged
`sentiment_label`, `sentiment_json`	`generateDocumentAnalysis`
`review_json`	`runDocumentReview` (classifier + per-type rubric)
`processing_stage`	DO updates as the pipeline progresses (`scanning_pii`, `embedding`, `analyzing`, `finalizing`)

The frontend reads these straight off the documents API and renders them as panels on the document detail page, so the NLP work is visible to users as soon as ingest finishes.

What's next

Hybrid retrieval (BM25 alongside vectors) is the obvious next move — at the 0.3 BGE floor I'm sometimes missing exact-keyword queries.
Caching contextual headers per (doc, paragraph hash) would cut ingest latency on large docs significantly.
Dedicated cross-encoder reranker (e.g. bge-reranker-base) would be cheaper and likely better-calibrated than reusing Gemma 4.
A retrieval eval harness — even a small hand-labeled set of 50 query/relevant-chunk pairs — would let me regress these knobs against a number instead of intuition.

If you're working on RAG, edge AI, or PIPEDA-aware document handling, I'd love to compare notes — drop a comment or open a thread on caseonix.ca.

Live app: localmind.caseonix.ca

Extracting T4 Data from PDFs in Python — A Canadian Developer's Guide

SKasagar — Sat, 11 Apr 2026 01:46:00 +0000

Cross-posted from caseonix.ca

Every Canadian fintech team eventually hits this problem. Users upload their T4 slips. Your backend gets a PDF. Somewhere between that PDF and your database you need to pull out box 14, box 22, the SIN, the employer name — correctly, reliably, across documents from dozens of different payroll software vendors.

The obvious tools get recommended: AWS Textract, LlamaParse, pdfplumber, PyMuPDF. They're good at what they do. But none of them know what a T4 is. They don't know that box 14 is employment income, that box 22 is income tax deducted, that a nine-digit formatted number is a Social Insurance Number, or that CRA publishes an XML specification for this document every year. They hand you text. The domain knowledge you write yourself.

That ends up being more work than people expect. I've seen it written three or four different ways at different companies, none with tests, none with audit trails, all slightly wrong at the edges. This is the guide I wish existed before I started.

What's a T4? For non-Canadian readers: a T4 (Statement of Remuneration Paid) is the Canadian equivalent of a US W-2. Every employer issues one annually to report employment income, CPP contributions, EI premiums, and income tax withheld. It's one of the most common documents in Canadian fintech, mortgage underwriting, and tax software.

Why Regex Isn't Enough

The first instinct is regex. T4s are standardized CRA forms — surely field positions are consistent?

import re
import pdfplumber

with pdfplumber.open("t4_2024.pdf") as pdf:
    text = pdf.pages[0].extract_text()

box_14 = re.search(r"14\s+[\$]?([\d,]+\.?\d*)", text)
if box_14:
    income = float(box_14.group(1).replace(",", ""))

This works on the T4 you tested it on. It breaks on the next one because a different payroll vendor laid the PDF out differently, the box number is on a different line than the value, or the document is a scanned image with no text layer.

Regex extraction of financial documents is essentially a parser that only works on documents you've already seen. Every new employer format becomes a special case. The maintenance cost compounds.

Step 1 — Get Clean Text with Docling

Docling is IBM's open-source document intelligence toolkit. It handles PDF text extraction, layout analysis, table recognition, and OCR fallback. Runs entirely locally, no API keys, MIT licensed.

pip install docling

Then convert any PDF:

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("t4_2024.pdf")

# Clean markdown-formatted text, reading order preserved
text = result.document.export_to_markdown()
print(text)

What comes out is structured text with layout preserved. Docling understands the difference between a table cell and a paragraph. It handles scanned documents through an OCR pipeline and correctly orders multi-column layouts. For T4 PDFs — which vary significantly between payroll vendors — the output quality is consistent in a way raw pdfplumber isn't.

First-run note: Docling downloads its layout models from HuggingFace on first run (~500MB). This is expected — models are cached locally after that. For production, pre-pull them during your Docker build step.

Docling gives you clean text. It still doesn't know what box 14 means. That's the next layer.

Step 2 — Extract Fields with pydantic-ai

Once you have clean text, you need to pull out specific typed fields reliably. The right tool for this today is an LLM with structured output — you give it a Pydantic model and it fills it in. pydantic-ai handles this cleanly and is model-agnostic: Claude, OpenAI, and local Ollama all work behind the same interface.

pip install pydantic-ai

Define your T4 model and agent:

from pydantic import BaseModel
from pydantic_ai import Agent

class T4Fields(BaseModel):
    employer_name: str
    tax_year: int
    box_14_employment_income: float
    box_22_income_tax_deducted: float
    box_16_cpp_contributions: float | None = None
    box_18_ei_premiums: float | None = None
    box_52_pension_adjustment: float | None = None
    province_of_employment: str | None = None

agent = Agent(
    "anthropic:claude-sonnet-4-6",
    result_type=T4Fields,
    system_prompt="""
    You are extracting fields from a Canadian T4 Statement of Remuneration Paid.
    Return monetary values as plain floats (87500.0, not "$87,500.00").
    Return null for any field not present in the document.
    Province of employment should be a 2-letter code (ON, BC, QC, etc.).
    Do not hallucinate values — if a field is not visible, return null.
    """,
)

result = agent.run_sync(f"Extract T4 fields:\n\n{text}")
fields = result.output

print(fields.box_14_employment_income)   # → 87500.0
print(fields.box_22_income_tax_deducted) # → 21340.0
print(fields.province_of_employment)     # → "ON"

To run fully locally with no external API calls, swap the model string:

agent = Agent(
    "ollama:llama3.2",   # no ANTHROPIC_API_KEY needed
    result_type=T4Fields,
    system_prompt=...,
)

For environments with data residency requirements — most regulated Canadian financial services — that matters. The document never leaves your infrastructure.

Step 3 — The Part Most Implementations Skip

Docling plus pydantic-ai gets you surprisingly far. In testing on T4 PDFs from major Canadian payroll providers, field extraction accuracy sits above 90% on the primary income and tax boxes.

But two things are missing that matter for production use in regulated industries.

Confidence scoring and a review queue

The LLM will be more certain about box 14 (employment income, usually prominent and clearly labeled) than about box 52 (pension adjustment, often blank or formatted inconsistently). If you're pre-filling a tax form with extracted values, you need to know which fields are safe to pass through automatically and which ones need a human to confirm.

Without confidence scores, low-quality extractions silently enter production. That's how incorrect T4 data gets submitted to CRA.

PII handling and an audit trail

A T4 contains a Social Insurance Number. Before you send that document text to any external API, you should know what PII is in it. Canada's PIPEDA requires that organizations limit the collection, use, and disclosure of personal information to what's necessary for the identified purpose — sending a full T4 text to a US-based cloud LLM for extraction is hard to defend under that standard unless you've taken steps to identify and handle the PII.

⚠️ The SIN problem: A Canadian SIN in the format XXX-XXX-XXX is sensitive personal information under PIPEDA. Every T4 contains one. If you're sending raw T4 text to a US-based cloud API without detecting and handling this, you're creating a compliance exposure that most legal teams would not be comfortable with.

Putting It Together: Docling + pydantic-ai + Presidio

Microsoft Presidio is an open-source PII detection and anonymization library. It supports custom recognizers — you can teach it what a Canadian SIN looks like, what a CRA Business Number looks like, and what Canadian postal codes look like. None of these ship in Presidio's defaults.

pip install presidio-analyzer presidio-anonymizer
python -m spacy download en_core_web_lg

Then add the Canadian recognizers and scan:

from presidio_analyzer import AnalyzerEngine, PatternRecognizer, Pattern
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig

analyzer = AnalyzerEngine()

# Add Canadian SIN recognizer — not in Presidio defaults
sin_recognizer = PatternRecognizer(
    supported_entity="CA_SIN",
    patterns=[Pattern("CA_SIN", r"\b\d{3}-\d{3}-\d{3}\b", score=0.9)],
    context=["sin", "social insurance"],
)
analyzer.registry.add_recognizer(sin_recognizer)

# Scan before sending to LLM
results = analyzer.analyze(text=document_text, language="en")
pii_found = [{"entity_type": r.entity_type, "score": r.score} for r in results]

# Optionally redact before the LLM call
anonymizer = AnonymizerEngine()
redacted = anonymizer.anonymize(
    text=document_text,
    analyzer_results=results,
    operators={"CA_SIN": OperatorConfig("replace", {"new_value": "***-***-***"})},
)
# Send redacted.text to the LLM instead

Now you know what PII was in the document before extraction ran, you have a record of it, and you can choose whether to redact before the LLM call.

The Full Stack in One Place: FinLit

Wiring Docling, pydantic-ai, Presidio, confidence scoring, audit logging, and CRA-specific schemas together is the kind of plumbing every team building on Canadian documents ends up writing. I built FinLit — an open-source Python library that does exactly this, with pre-built YAML schemas for T4, T5, T4A, NR4, and Canadian bank statements.

pip install finlit
python -m spacy download en_core_web_lg

Then run the pipeline:

from finlit import DocumentPipeline, schemas

pipeline = DocumentPipeline(
    schema=schemas.CRA_T4,
    extractor="claude",      # or "openai" or "ollama"
    audit=True,
    pii_redact=False,        # set True to redact SINs in audit log output
    review_threshold=0.85,
)

result = pipeline.run("john_doe_t4_2024.pdf")

The result object has everything:

# Typed, validated fields — monetary values are always float
result.fields["box_14_employment_income"]      # → 87500.0
result.fields["box_22_income_tax_deducted"]    # → 21340.0
result.fields["province_of_employment"]        # → "ON"

# Per-field confidence — box 52 came back uncertain
result.confidence["box_52_pension_adjustment"] # → 0.71

# Fields below the 0.85 threshold go here instead of silently through
result.needs_review    # → True
result.review_fields
# [{"field": "box_52_pension_adjustment", "confidence": 0.71, "raw": "4,200.00"}]

# Trace any extracted value back to its page and location
result.source_ref["box_14_employment_income"]
# {"page": 1, "bbox": [120, 340, 280, 360], "doc": "john_doe_t4_2024.pdf"}

# Immutable audit log — every event from load to completion
result.audit_log
# [
#   {"event": "document_loaded",     "sha256": "abc...", "ts": "..."},
#   {"event": "pii_detected",        "count": 1, "entities": ["CA_SIN"], "ts": "..."},
#   {"event": "extraction_complete", "fields_returned": 13, "ts": "..."},
#   {"event": "review_flagged",      "count": 1, "ts": "..."},
#   {"event": "pipeline_complete",   "fields_extracted": 13, "ts": "..."}
# ]

# Raw PII detections on the source document (Presidio output)
result.pii_entities
# [{"entity_type": "CA_SIN", "score": 0.9, "start": 142, "end": 153}]

For batch processing — say, a payroll integrator running hundreds of T4s at year-end:

from finlit import BatchPipeline, schemas
from glob import glob

batch = BatchPipeline(schema=schemas.CRA_T4, extractor="ollama", workers=8)

for path in glob("uploads/*.pdf"):
    batch.add(path)

results = batch.run()
results.export_csv("extracted/t4s_2024.csv")

print(f"Processed:    {results.total}")
print(f"Needs review: {results.review_count}")

The extractor="ollama" configuration means no document leaves your infrastructure. The pipeline runs entirely on-premises, which removes the PIPEDA third-party disclosure question entirely.

Build vs Buy vs Open-Source

Approach	Time to first extraction	Canadian schemas	Audit trail	Data residency	Cost
Regex + pdfplumber	Hours	You write them	None	On-prem	Free
AWS Textract	Hours	None	Partial	US only	$1.50/1000 pages
LlamaParse	Minutes	None	None	US SaaS	$3–$10/1000 pages
Docling alone	Hours	You write them	None	On-prem	Free
FinLit	Minutes	T4, T5, T4A, NR4, bank statements	Built in	On-prem or cloud	Free + LLM costs

What the Schema YAML Looks Like

Every built-in schema in FinLit is a versioned YAML file. The T4 schema maps directly to CRA's published XML specification. Here's a simplified excerpt:

name: cra_t4
version: "2024"
document_type: "CRA T4 Statement of Remuneration Paid"

fields:
  - name: box_14_employment_income
    dtype: float
    required: true
    description: "Box 14: Total employment income before deductions"

  - name: employee_sin
    dtype: str
    required: true
    pii: true
    regex: '^\d{3}-\d{3}-\d{3}$'
    description: "Employee's Social Insurance Number"

  - name: province_of_employment
    dtype: str
    required: false
    description: "Province or territory of employment (2-letter code)"

The pii: true flag tells the pipeline this field is sensitive — it gets flagged in the audit log and can be redacted depending on your pii_redact configuration. The regex field enforces format validation after extraction, so a malformed SIN raises a validation error rather than silently passing through.

Adding a new schema for a document type that isn't in the registry yet takes about 20 minutes if you know the document. Schema contributions are the highest-value PRs the project gets.

Practical Notes from Building This

A few things that aren't obvious until you've processed a few thousand real T4s:

Scanned T4s are common. Many smaller employers still print and scan. Docling's OCR pipeline handles these, but accuracy drops — budget for a higher review threshold (0.90 vs 0.85) on scanned documents.
Box 52 (pension adjustment) is almost always uncertain. It's blank for most employees, optionally present for others, and formatted inconsistently across payroll vendors. Flag it for review at any confidence below 0.95 if your use case relies on it.
Quebec T4s have additional fields. RL-1 slips carry Quebec provincial tax information that a standard T4 schema doesn't cover. If you're processing documents from Quebec employees, you'll want a separate RL-1 schema.
CRA updates its XML specification annually. Field names and codes are stable, but new boxes get added. Pin your schema version and test against new documents at the start of each tax year.
Multi-page T4s exist. Most T4s are single-page, but amended T4s can span two pages. Docling handles this correctly; regex approaches often don't.

The Short Version

Use Docling for parsing, pydantic-ai for field extraction, Presidio for PII detection, and either wire it together yourself or use FinLit to skip the plumbing. Run it locally with Ollama if you can't send documents to a cloud API. Build an audit log from the start — retrofitting one later is painful.

Building PIPEDA-Compliant AI Tools on Cloudflare Workers — A Developer's Guide

SKasagar — Tue, 07 Apr 2026 20:02:00 +0000

Canada still runs on PIPEDA, Bill C-27 died on the Order Paper, and the CLOUD Act didn't go anywhere. Here's what that actually means if you're building AI tools for the Canadian market in 2026 — and how to ship them without a compliance incident.

The Regulatory Landscape: What Actually Applies in 2026

If you've been waiting for Ottawa to sort out AI regulation, you'll be waiting a while longer. Bill C-27 — which would have introduced the Consumer Privacy Protection Act (CPPA) and the Artificial Intelligence and Data Act (AIDA) — died when Parliament was prorogued in January 2025. A snap federal election in April 2025 pushed reform further down the road. As of April 2026, Canada has no federal AI-specific legislation.

I spent 25 years in financial services before starting to build AI tools for this market. The compliance landscape isn't new to me — but the gap between what AI vendors promise and what Canadian regulations actually require was wide enough to build a company in.

That doesn't mean you're operating in a vacuum. Three frameworks define your compliance obligations right now:

PIPEDA (federal) — Canada's Personal Information Protection and Electronic Documents Act, written in 2000 but still the law. It requires meaningful consent, accountability for data in the hands of third parties, and "comparable protection" for cross-border transfers.
Quebec's Law 25 (provincial) — Fully enforced since September 2024 and significantly stricter than PIPEDA. Requires explicit consent for automated decision-making, mandatory Privacy Impact Assessments for high-risk AI, and penalties up to C$25M or 4% of global revenue.
OSFI B-13 (sector-specific) — If you serve federally regulated financial institutions, OSFI's Technology and Cyber Security Risk Management guideline requires third-party risk management that extends to AI service providers.

Most builders now align with Quebec Law 25 as their baseline — it's the strictest Canadian framework, and if you comply with it, you effectively comply with PIPEDA too. If you serve financial institutions, layer OSFI B-13 on top.

The CLOUD Act Problem Nobody Wants to Talk About

Here's the uncomfortable truth about "Canadian data residency" in 2026: storing data in a Canadian data centre run by a US company does not protect it from US government access.

The US CLOUD Act (Clarifying Lawful Overseas Use of Data Act) gives American authorities the power to compel US-headquartered companies to hand over data regardless of where that data is physically stored. This means AWS Canada Central in Montreal, Azure Canada East in Quebec City, and Google Cloud's Montreal region are all subject to US legal orders — even though the bits never leave Canadian soil.

For most consumer applications, this is a theoretical risk. But for legal firms handling privileged documents, financial institutions under OSFI oversight, healthcare organizations subject to PHIPA, or government contractors — it's a real compliance problem that auditors are increasingly asking about.

What this means for your architecture

You have three tiers of Canadian data residency, and they offer very different levels of protection:

Tier	What It Means	CLOUD Act Exposure	Examples
1. Canadian-operated infrastructure	Data processed by a Canadian-incorporated company on Canadian servers	None	ThinkOn/Hypertec sovereign cloud, TELUS/OpenText sovereign cloud, Bell/SAP sovereign cloud
2. US hyperscaler, Canadian region	Data in Canada, but operator is US-incorporated	Yes — compellable by US legal order	AWS ca-central-1, Azure Canada East, GCP Montreal
3. US processing	Data leaves Canada entirely	Full exposure	ChatGPT, Copilot (most configurations), Gemini

For most regulated use cases, Tier 2 is the pragmatic minimum — it satisfies PIPEDA's "comparable protection" standard and is what most organizations document in their PIAs. Tier 1 is where you go when the threat model specifically includes foreign government access to data, which is increasingly the case in defence, government, and privileged legal work.

Five Design Principles for Compliance-First AI

After building LocalMind, a sovereign document intelligence platform for the Canadian market, I've arrived at five architectural principles that make compliance a design spec rather than an afterthought.

1. Pin computation to geography

Don't just store data in Canada — process it there too. Every API call to a US-hosted LLM is a cross-border transfer under PIPEDA. Cloudflare Workers run at the edge and can be pinned to Canadian data centres using Custom Regions (launched March 2026). Workers AI provides embedding models that execute on-region. For LLM inference, route through an AI Gateway with jurisdiction controls.

How I built LocalMind: All TLS termination, embedding generation, vector search, and document processing runs on Cloudflare's Canadian edge. LLM calls route through AI Gateway with Canadian jurisdiction pinning. The result: sub-5ms cold starts and zero US data exposure.

2. Detect and redact PII before it hits the model

The simplest way to reduce your compliance surface is to never send personal information to the LLM in the first place. Build a PII detection layer that runs before any AI processing:

Pattern matching for structured PII: SINs (Canadian Social Insurance Numbers), credit card numbers, health card IDs, phone numbers, email addresses
Named Entity Recognition for unstructured PII: names, addresses, dates of birth
Redaction options: replace with tokens ([PERSON_1]), mask partially (***-***-123), or strip entirely

This isn't just good compliance hygiene — it also reduces hallucination risk, because the model isn't distracted by personal details that are irrelevant to the analysis.

3. Log everything, explain everything

Quebec's Law 25, Section 12.1 requires you to explain automated decisions to affected individuals. PIPEDA's accountability principle (Principle 1) makes you responsible for data in the hands of third-party processors. Both of these demand audit trails.

At minimum, log:

What data was sent to which AI model, and when
What the model returned
What decision was made based on that output
What PII was detected and how it was handled
Which user or process initiated the request

Store these logs in the same jurisdiction as the data itself. If your compute is in Canada but your logs are in Datadog's US region, you've created a cross-border transfer that undermines the whole architecture.

4. Build for human-in-the-loop

Law 25 requires that individuals can request human review of automated decisions. PIPEDA's accuracy principle (Principle 6) means AI-generated conclusions need to be challengeable. Build this into the product from day one:

Every AI-generated finding should cite its source document and passage
Users should be able to override, dismiss, or escalate any automated assessment
Confidence scores should be visible, not hidden behind a clean UI
Critical decisions (compliance pass/fail, risk ratings) should require explicit human confirmation

5. Isolate tenants at the data layer

Multi-tenant AI systems need strict namespace isolation. When Organization A uploads a contract, Organization B's vector search must never surface it — even if the embeddings are mathematically similar. Use per-tenant namespaces in your vector database, per-tenant encryption keys if possible, and never co-mingle document chunks across organizational boundaries.

Canadian Infrastructure Options in 2026

The Canadian AI infrastructure landscape has expanded significantly. Here's what's actually available for builders:

Provider	Canadian AI Services	Sovereignty Level	Best For
Cloudflare	Workers AI (embeddings, inference), Vectorize, D1, R2, Custom Regions for Canada	US-incorporated, but Custom Regions pin processing to Canadian PoPs	Edge-first apps, document processing, low-latency AI
AWS Canada	Bedrock (foundation models), SageMaker, ca-central-1 and ca-west-1	Tier 2 (US-incorporated)	Enterprise workloads, teams already on AWS
Azure Canada	Azure OpenAI (Canada East), Azure ML, Copilot with in-country processing (2026)	Tier 2 (US-incorporated)	Microsoft shops, government (with caveats)
ThinkOn/Hypertec/Aptum	Sovereign government cloud (launched Oct 2025)	Tier 1 (Canadian-incorporated)	Federal/provincial government, defence
TELUS/OpenText	Sovereign cloud (launched Jul 2025)	Tier 1 (Canadian-incorporated)	Regulated industries, healthcare
Bell/SAP	Sovereign cloud (launched Feb 2026)	Tier 1 (Canadian-controlled)	Enterprise ERP with sovereign AI

A Compliance Checklist for Shipping

Before you launch an AI tool for the Canadian market, run through this:

[ ] Data residency documented: You can state exactly where data is stored and processed, and which jurisdictions apply to your providers.
[ ] PII detection in place: Personal information is identified and handled (redacted, masked, or consented) before AI processing.
[ ] Consent is meaningful: Users understand, in plain language, that AI will process their information and how.
[ ] Automated decisions are explainable: Every AI output cites its source, and users can request human review.
[ ] Audit trail exists: Every AI interaction is logged — input, output, model used, timestamp, user — and logs are stored in the same jurisdiction as the data.
[ ] Privacy Impact Assessment completed: Required by Law 25 for high-risk AI; good practice everywhere.
[ ] Cross-border transfers documented: If any data leaves Canada (including for LLM inference), you've documented the legal basis and safeguards.
[ ] Tenant isolation tested: Multi-tenant systems have been tested to confirm no cross-tenant data leakage in search, retrieval, or AI outputs.
[ ] Third-party risk assessed: You've evaluated your AI providers' CLOUD Act exposure and documented it in your risk register.
[ ] Breach response plan includes AI: Your incident response plan covers scenarios where AI-processed data is compromised.

I built LocalMind with compliance as a design constraint — the same way you'd treat latency or uptime. The regulatory landscape will catch up eventually. The question is whether your architecture is ready when it does.