Forem: Thesius Code

AI Agent Framework

Thesius Code — Mon, 23 Mar 2026 15:14:20 +0000

AI Agent Framework

Build production-grade multi-agent systems that can reason, plan, use tools, and collaborate. This framework provides battle-tested patterns for orchestrating LLM-powered agents — from simple single-agent tool callers to complex multi-agent workflows with human oversight. Stop reinventing agent infrastructure and focus on your domain logic.

Key Features

Multi-Agent Orchestration — Route tasks between specialized agents with configurable delegation strategies (round-robin, capability-based, auction)
Tool Calling Engine — Register Python functions as tools with automatic schema generation, input validation, and retry logic
Persistent Memory — Short-term conversation buffers and long-term vector-backed memory with configurable retention policies
Planning Loops — ReAct, Plan-and-Execute, and Tree-of-Thought planning patterns with step-level observability
Human-in-the-Loop Gates — Configurable approval checkpoints before high-risk actions (API calls, data mutations, external communication)
Structured Output Parsing — Enforce JSON schemas on agent responses with automatic retry on malformed output
Conversation Branching — Fork and merge conversation threads for parallel agent exploration
Execution Tracing — Full trace logs for every agent step, tool call, and decision point

Quick Start

from agent_framework import Agent, ToolRegistry, Orchestrator

# 1. Define tools
registry = ToolRegistry()

@registry.tool(description="Search the knowledge base for relevant documents")
def search_docs(query: str, top_k: int = 5) -> list[dict]:
    # Your retrieval logic here
    return [{"title": "Result", "content": "..."}]

@registry.tool(description="Send a formatted email to a recipient")
def send_email(to: str, subject: str, body: str) -> dict:
    # Requires human approval (configured below)
    return {"status": "sent", "to": to}

# 2. Create agents with different capabilities
researcher = Agent(
    name="researcher",
    system_prompt="You find and synthesize information from the knowledge base.",
    tools=[search_docs],
    model="gpt-4o",
)

writer = Agent(
    name="writer",
    system_prompt="You draft professional emails based on research findings.",
    tools=[send_email],
    model="gpt-4o",
    require_approval=["send_email"],  # Human gate on this tool
)

# 3. Orchestrate
orchestrator = Orchestrator(agents=[researcher, writer])
result = orchestrator.run("Research our Q3 metrics and email a summary to the team.")
print(result.final_output)
print(result.trace.summary())  # Shows all steps, tools called, tokens used

Architecture

┌─────────────────────────────────────────────┐
│                Orchestrator                  │
│  ┌─────────┐  ┌──────────┐  ┌───────────┐  │
│  │ Planner │  │ Router   │  │ Evaluator │  │
│  └────┬────┘  └────┬─────┘  └─────┬─────┘  │
│       │            │              │          │
│  ┌────▼────────────▼──────────────▼─────┐   │
│  │            Agent Pool                │   │
│  │  ┌─────┐  ┌─────┐  ┌─────┐         │   │
│  │  │ A_1 │  │ A_2 │  │ A_N │  ...    │   │
│  │  └──┬──┘  └──┬──┘  └──┬──┘         │   │
│  └─────┼────────┼────────┼─────────────┘   │
│        │        │        │                  │
│  ┌─────▼────────▼────────▼─────────────┐   │
│  │         Shared Services              │   │
│  │  Memory │ Tools │ Approval │ Trace   │   │
│  └──────────────────────────────────────┘   │
└─────────────────────────────────────────────┘

The Planner decomposes tasks into subtasks. The Router assigns subtasks to the best-fit agent. The Evaluator checks if the final output satisfies the original request and triggers replanning if needed.

Usage Examples

ReAct Planning Loop

from agent_framework import Agent, PlanningStrategy

agent = Agent(
    name="analyst",
    system_prompt="You analyze datasets and produce insights.",
    planning=PlanningStrategy.REACT,  # Thought -> Action -> Observation loop
    max_iterations=10,
)

result = agent.run("What were the top 3 revenue drivers last quarter?")
for step in result.trace.steps:
    print(f"[{step.type}] {step.content[:100]}")

Multi-Agent Delegation

from agent_framework import Orchestrator, DelegationStrategy

orchestrator = Orchestrator(
    agents=[researcher, writer, reviewer],
    strategy=DelegationStrategy.CAPABILITY_BASED,
    max_delegation_depth=3,  # Prevent infinite delegation loops
)

Memory Configuration

from agent_framework.memory import ConversationMemory, VectorMemory

agent = Agent(
    name="assistant",
    memory=[
        ConversationMemory(max_turns=20),           # Recent context
        VectorMemory(collection="long_term", top_k=5),  # Semantic recall
    ],
)

Configuration

# config.yaml
orchestrator:
  max_concurrent_agents: 4
  timeout_seconds: 120
  delegation_strategy: "capability_based"

agents:
  default_model: "gpt-4o"
  temperature: 0.1
  max_tokens: 4096
  retry_on_parse_failure: true
  max_retries: 3

memory:
  backend: "sqlite"            # sqlite | redis | postgres
  conversation_buffer_size: 20
  vector_store: "chromadb"
  embedding_model: "text-embedding-3-small"

approval:
  enabled: true
  timeout_seconds: 300         # Auto-reject after 5 minutes
  notify_channel: "slack"      # slack | email | console
  high_risk_tools:
    - "send_email"
    - "execute_sql"
    - "deploy_service"

tracing:
  enabled: true
  output: "logs/traces/"
  format: "json"               # json | opentelemetry
  include_prompts: false       # Set true only in dev (contains PII)

Best Practices

Start with a single agent — Add multi-agent orchestration only when a single agent demonstrably can't handle the task breadth.
Gate destructive tools — Any tool that mutates state (DB writes, API calls, file deletions) should require human approval in production.
Set iteration limits — Always configure max_iterations on planning loops to prevent runaway token consumption.
Use structured outputs — Define Pydantic models for tool inputs/outputs to catch schema mismatches early.
Log everything in development — Enable full tracing with include_prompts: true during development, disable in production.
Test with deterministic seeds — Use temperature: 0 and fixed seeds during testing for reproducible agent behavior.
Monitor token budgets — Set per-agent and per-orchestration token limits to avoid surprise costs during complex planning loops.

Troubleshooting

Problem	Cause	Fix
Agent loops forever without producing output	Missing or too-high `max_iterations`	Set `max_iterations: 10` and add an evaluator to check completion
Tool calls fail with schema validation errors	LLM generates malformed JSON arguments	Enable `retry_on_parse_failure: true` and add examples to tool descriptions
Multi-agent tasks produce contradictory results	Agents lack shared context	Use shared `VectorMemory` or pass the orchestrator's scratchpad between agents
Human approval gate times out	`timeout_seconds` too low or notification missed	Increase timeout, add fallback notification channel, check Slack/email integration

This is 1 of 11 resources in the AI Builder Pro toolkit. Get the complete [AI Agent Framework] with all files, templates, and documentation for $59.

Get the Full Kit →

Or grab the entire AI Builder Pro bundle (11 products) for $169 — save 30%.

Get the Complete Bundle →

AI API Gateway

Thesius Code — Mon, 23 Mar 2026 15:14:15 +0000

AI API Gateway

Stop hard-coding provider-specific API calls throughout your codebase. This gateway gives you a single unified interface to OpenAI, Anthropic, Google, Mistral, and local models — with automatic fallback routing, response caching, rate limiting, and real-time usage analytics. Switch providers, manage costs, and add resilience without changing a single line of application code.

Key Features

Unified API Interface — One consistent request/response format across OpenAI, Anthropic, Google Gemini, Mistral, and Ollama
Automatic Fallback Routing — Define provider priority chains; if the primary provider fails or hits rate limits, requests route to the next provider seamlessly
Response Caching — Cache identical prompts with configurable TTL to slash costs on repeated queries (Redis or in-memory)
Rate Limiting — Per-user, per-model, and global rate limits with token bucket algorithm
Usage Analytics Dashboard — Track tokens, latency, cost, and error rates per provider/model/user in real time
Request/Response Middleware — Plug in custom transforms (PII scrubbing, logging, prompt injection detection) as middleware
Streaming Support — Full SSE streaming passthrough with provider-agnostic event format
API Key Rotation — Rotate provider API keys without downtime via hot-reload configuration

Quick Start

from ai_gateway import Gateway, Provider

# 1. Configure providers
gateway = Gateway(
    providers=[
        Provider(
            name="openai",
            api_key="YOUR_OPENAI_KEY_HERE",
            models=["gpt-4o", "gpt-4o-mini"],
            priority=1,
        ),
        Provider(
            name="anthropic",
            api_key="YOUR_ANTHROPIC_KEY_HERE",
            models=["claude-sonnet-4-20250514"],
            priority=2,  # Fallback when OpenAI is unavailable
        ),
    ],
    cache_backend="redis",
    cache_ttl=3600,
)

# 2. Make requests — same interface regardless of provider
response = gateway.chat(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain quantum computing in one paragraph."}],
    max_tokens=200,
)
print(response.content)
print(f"Provider: {response.provider}, Cost: ${response.cost:.4f}")

Architecture

Client Request
      │
      ▼
┌──────────────┐
│  Rate Limiter │──── Reject (429) if over limit
└──────┬───────┘
       ▼
┌──────────────┐
│  Cache Check  │──── Return cached response if hit
└──────┬───────┘
       ▼
┌──────────────┐
│  Middleware   │──── Pre-process (PII scrub, logging, validation)
│  Pipeline     │
└──────┬───────┘
       ▼
┌──────────────┐     ┌──────────┐
│  Router      │────▶│Provider A│──── Success ──▶ Response
│              │     └──────────┘
│              │──── Failure ────▶┌──────────┐
│              │                  │Provider B│──── Fallback
│              │                  └──────────┘
└──────┬───────┘
       ▼
┌──────────────┐
│  Analytics   │──── Log tokens, latency, cost, errors
└──────────────┘

Usage Examples

Fallback Routing with Cost Controls

from ai_gateway import Gateway, RoutingPolicy

gateway = Gateway(
    routing=RoutingPolicy(
        primary="openai/gpt-4o",
        fallbacks=["anthropic/claude-sonnet-4-20250514", "mistral/mistral-large"],
        fallback_on=["rate_limit", "timeout", "server_error"],
        max_cost_per_request=0.05,  # Skip expensive models if budget exceeded
    )
)

Streaming Responses

for chunk in gateway.chat_stream(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a haiku about Python."}],
):
    print(chunk.delta, end="", flush=True)

Custom Middleware

from ai_gateway.middleware import Middleware

class PIIScrubber(Middleware):
    """Remove PII from prompts before sending to providers."""

    def pre_request(self, request):
        for msg in request.messages:
            msg["content"] = self.redact_emails(msg["content"])
            msg["content"] = self.redact_phones(msg["content"])
        return request

gateway.add_middleware(PIIScrubber())

Usage Analytics

stats = gateway.analytics.summary(period="24h")
print(f"Total requests: {stats.total_requests}")
print(f"Total cost: ${stats.total_cost:.2f}")
print(f"Avg latency: {stats.avg_latency_ms:.0f}ms")
print(f"Cache hit rate: {stats.cache_hit_rate:.1%}")
for provider in stats.by_provider:
    print(f"  {provider.name}: {provider.requests} reqs, {provider.error_rate:.1%} errors")

Configuration

# gateway_config.yaml
providers:
  openai:
    api_key: "${OPENAI_API_KEY}"
    base_url: "https://api.example.com/v1/"
    models: ["gpt-4o", "gpt-4o-mini"]
    timeout_seconds: 30
    max_retries: 2
    priority: 1

  anthropic:
    api_key: "${ANTHROPIC_API_KEY}"
    models: ["claude-sonnet-4-20250514"]
    timeout_seconds: 60
    priority: 2

rate_limiting:
  global_rpm: 1000             # Requests per minute across all users
  per_user_rpm: 60
  per_model_rpm: 500
  algorithm: "token_bucket"

cache:
  backend: "redis"             # redis | memory | disabled
  redis_url: "redis://localhost:6379/0"
  ttl_seconds: 3600
  max_cache_size_mb: 512
  hash_strategy: "content"     # content | content+model | full_request

analytics:
  enabled: true
  storage: "sqlite"            # sqlite | postgres
  retention_days: 90
  dashboard_port: 8080

middleware:
  - pii_scrubber
  - request_logger
  - prompt_injection_detector

Best Practices

Set per-user rate limits — Prevent a single user from exhausting your entire API quota.
Cache aggressively for deterministic queries — If temperature: 0, the same prompt always yields the same result. Cache it.
Use the cheapest model that works — Route simple tasks to gpt-4o-mini and reserve gpt-4o for complex reasoning.
Monitor error rates per provider — A sudden spike in 500s from one provider means your fallback chain is earning its keep.
Rotate API keys on a schedule — Use hot-reload config to rotate keys monthly without gateway restarts.
Test fallback paths — Intentionally disable your primary provider in staging to verify fallback routing works end-to-end.

Troubleshooting

Problem	Cause	Fix
All requests return 429 Too Many Requests	Global rate limit too low for traffic volume	Increase `global_rpm` in config or add more provider API keys
Cache never hits despite repeated prompts	Message metadata (timestamps, request IDs) differ between calls	Set `hash_strategy: "content"` to hash only message content
Fallback provider returns format errors	Response schemas differ between providers	Ensure `response_format` normalization is enabled in middleware
Analytics dashboard shows $0 cost	Cost calculation requires model pricing table	Update `pricing.yaml` with current per-token rates for each model

This is 1 of 11 resources in the AI Builder Pro toolkit. Get the complete [AI API Gateway] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire AI Builder Pro bundle (11 products) for $169 — save 30%.

Get the Complete Bundle →

AI Safety & Guardrails Kit

Thesius Code — Mon, 23 Mar 2026 15:14:12 +0000

AI Safety & Guardrails Kit

Deploy LLM-powered features with confidence. This toolkit provides production-ready input/output filtering that catches toxic content, removes PII before it reaches your model, detects hallucinated facts, and enforces your content policies programmatically. Every filter is configurable, auditable, and designed to run with minimal latency in your request pipeline.

Key Features

Input Sanitization — Detect and block prompt injection attacks, jailbreak attempts, and malicious payloads before they reach your LLM
PII Redaction — Automatically detect and mask emails, phone numbers, SSNs, credit cards, and custom patterns in both inputs and outputs
Toxicity Detection — Score content across categories (hate speech, harassment, self-harm, sexual content) with configurable thresholds
Hallucination Detection — Cross-reference LLM outputs against source documents to flag unsupported claims
Content Policy Enforcement — Define custom rules (blocked topics, required disclaimers, output format constraints) as composable policy objects
Audit Logging — Every filter decision is logged with timestamps, scores, and the rule that triggered — essential for compliance
Streaming-Compatible — Filters work on both complete responses and streaming chunks with minimal buffering

Quick Start

from safety_guardrails import GuardrailPipeline, filters

# Build a pipeline of filters
pipeline = GuardrailPipeline([
    filters.PromptInjectionDetector(threshold=0.85),
    filters.PIIRedactor(
        entities=["email", "phone", "ssn", "credit_card"],
        action="mask",  # mask | remove | hash
    ),
    filters.ToxicityFilter(
        max_score=0.7,
        categories=["hate", "harassment", "self_harm"],
    ),
    filters.ContentPolicy.from_yaml("policies/company_policy.yaml"),
])

# Filter input before sending to LLM
user_input = "My email is user@example.com and I need help with..."
safe_input = pipeline.filter_input(user_input)
print(safe_input.text)
# "My email is [EMAIL_REDACTED] and I need help with..."
print(safe_input.redacted_entities)
# [{"type": "email", "original": "user@example.com", "position": [12, 28]}]

# Filter output before returning to user
llm_output = "According to our records, your SSN is 123-45-6789..."
safe_output = pipeline.filter_output(llm_output)
print(safe_output.blocked)  # True if any filter triggered a block
print(safe_output.text)     # Sanitized text
print(safe_output.audit_log)  # Full decision trace

Architecture

User Input                              LLM Output
    │                                       │
    ▼                                       ▼
┌──────────────┐                    ┌──────────────┐
│ Input Filter │                    │Output Filter │
│   Pipeline   │                    │  Pipeline    │
│              │                    │              │
│ 1. Injection │                    │ 1. PII       │
│ 2. PII       │                    │ 2. Toxicity  │
│ 3. Toxicity  │                    │ 3. Hallucin. │
│ 4. Policy    │                    │ 4. Policy    │
└──────┬───────┘                    └──────┬───────┘
       │                                   │
       ▼                                   ▼
   Safe Input ──────▶ LLM ──────▶ Raw Output
       │                                   │
       └───────── Audit Log ◀──────────────┘

Each filter returns a FilterResult with: the (possibly modified) text, a boolean blocked flag, a numeric score, and metadata for the audit trail.

Usage Examples

Custom Content Policies

from safety_guardrails import ContentPolicy, PolicyRule

policy = ContentPolicy(rules=[
    PolicyRule(
        name="no_medical_advice",
        pattern=r"(you should take|recommended dosage|diagnos)",
        action="block",
        message="Medical advice is outside our scope. Please consult a doctor.",
    ),
    PolicyRule(
        name="require_disclaimer",
        condition="topic:financial",
        action="append",
        message="\n\n*This is not financial advice. Consult a licensed advisor.*",
    ),
    PolicyRule(
        name="block_competitor_mentions",
        keywords=["CompetitorA", "CompetitorB"],
        action="redact",
    ),
])

Hallucination Detection Against Source Documents

from safety_guardrails.hallucination import HallucinationDetector

detector = HallucinationDetector(method="nli", threshold=0.6)

sources = ["Acme Corp reported $2.1M revenue in Q3 2025."]
output = "Acme Corp reported $5.3M revenue in Q3 2025."

result = detector.check(output=output, sources=sources)
print(result.score)          # 0.92 (high hallucination probability)
print(result.flagged_claims) # ["$5.3M revenue" — contradicts source "$2.1M"]

Streaming Output Filtering

from safety_guardrails import StreamFilter

stream_filter = StreamFilter(pipeline=pipeline, buffer_size=50)

for chunk in llm_stream:
    safe_chunk = stream_filter.process_chunk(chunk)
    if safe_chunk.text:
        yield safe_chunk.text  # Only emits after safety check

Configuration

# guardrails_config.yaml
pii_redaction:
  enabled: true
  entities:
    - email
    - phone
    - ssn
    - credit_card
    - ip_address
  action: "mask"                 # mask | remove | hash
  mask_char: "*"
  custom_patterns:
    employee_id: '\bEMP-\d{6}\b'
    internal_code: '\b[A-Z]{3}-\d{4}\b'

toxicity:
  enabled: true
  model: "local"                 # local | api
  threshold: 0.7
  categories:
    hate: 0.6
    harassment: 0.7
    self_harm: 0.5               # Very strict on self-harm content
    sexual: 0.8

prompt_injection:
  enabled: true
  methods:
    - "heuristic"                # Fast regex-based patterns
    - "classifier"               # ML-based detection
  threshold: 0.85
  block_action: "reject"         # reject | sanitize | flag

hallucination:
  enabled: true
  method: "nli"                  # nli | embedding_similarity
  threshold: 0.6
  require_sources: true

audit:
  enabled: true
  storage: "sqlite"              # sqlite | postgres | file
  retention_days: 365
  log_blocked_content: true
  pii_in_logs: false             # Never log actual PII values

Best Practices

Layer your defenses — Use heuristic injection detection (fast, cheap) AND classifier-based detection (accurate) together.
Set strict thresholds in production, relaxed in dev — Use environment-specific config files.
Audit everything — Regulators and security teams will ask "what did the AI say on date X?" Have the answer ready.
Redact PII before it hits the LLM — Once PII is in the prompt, it may appear in provider logs. Redact on input, not just output.
Test with adversarial inputs — Maintain a red-team prompt set and run it against your guardrails in CI/CD.
Don't block silently — When content is blocked, return a helpful message explaining why and what the user can do instead.

Troubleshooting

Problem	Cause	Fix
PII redactor misses custom ID formats	Pattern not in default entity list	Add custom regex under `custom_patterns` in config
Toxicity filter blocks legitimate medical content	Threshold too aggressive for health domain	Raise category thresholds or add domain-specific allowlists
Prompt injection detector has high false positives	Heuristic rules too broad	Switch to `classifier` method or raise `threshold` to 0.9+
Streaming filter adds noticeable latency	Buffer size too large	Reduce `buffer_size` to 20-30 tokens; accept slightly lower accuracy

This is 1 of 11 resources in the AI Builder Pro toolkit. Get the complete [AI Safety & Guardrails Kit] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire AI Builder Pro bundle (11 products) for $169 — save 30%.

Get the Complete Bundle →

Conversational AI Templates

Thesius Code — Mon, 23 Mar 2026 15:14:08 +0000

Conversational AI Templates

Production-ready chatbot frameworks that handle the hard parts of conversation design — multi-turn context management, intent classification, graceful topic switching, and seamless human escalation. These templates give you a working conversational AI system in minutes, not months. Built for customer support, internal tools, and domain-specific assistants.

Key Features

Multi-Turn Context Management — Sliding window and summary-based context strategies that keep conversations coherent without exploding token costs
Intent Classification — Rule-based and LLM-powered intent detection with confidence scoring and ambiguity resolution
Conversation State Machine — Define conversation flows as state machines with transitions, guards, and side effects
Escalation Workflows — Automatic handoff to human agents when confidence drops, sentiment turns negative, or the user explicitly requests it
Slot Filling — Extract structured data from natural language with validation, confirmation prompts, and partial-fill handling
Persona System — Configure tone, vocabulary, and behavioral constraints to match your brand voice
Conversation Analytics — Track resolution rates, escalation frequency, average turns to resolution, and user satisfaction

Quick Start

from conversational_ai import Chatbot, ConversationConfig, IntentRouter

# 1. Define intents and their handlers
router = IntentRouter()

@router.intent("order_status", examples=[
    "Where is my order?",
    "Track my package",
    "When will my order arrive?",
])
def handle_order_status(context):
    order_id = context.slots.get("order_id")
    if not order_id:
        return context.ask_slot("order_id", "What's your order number?")
    return context.respond(f"Let me look up order {order_id} for you...")

@router.intent("refund_request", examples=[
    "I want a refund",
    "Can I return this item?",
])
def handle_refund(context):
    return context.respond(
        "I can help with that. Let me connect you with our returns team.",
        escalate=True,
        reason="refund_request",
    )

# 2. Create chatbot
bot = Chatbot(
    router=router,
    config=ConversationConfig(
        persona="You are a friendly customer support agent for Acme Corp.",
        context_strategy="sliding_window",
        context_window=10,
        escalation_threshold=0.4,
    ),
)

# 3. Run conversation
session = bot.new_session(user_id="user_123")
response = session.message("Hi, where's my order #ACM-7842?")
print(response.text)        # "Let me look up order ACM-7842 for you..."
print(response.intent)      # "order_status"
print(response.confidence)  # 0.94

Architecture

User Message
      │
      ▼
┌──────────────┐
│  Preprocessor │──── Normalize, spell-check, language detect
└──────┬───────┘
       ▼
┌──────────────┐
│Intent Classif.│──── Match intent + extract entities
└──────┬───────┘
       │
       ├── High confidence ──▶ Route to intent handler
       ├── Low confidence ───▶ Clarification prompt
       └── Very low ─────────▶ Escalation to human
                                      │
Intent Handler                        ▼
       │                      Human Agent Queue
       ▼
┌──────────────┐
│ Slot Filler  │──── Extract & validate required data
└──────┬───────┘
       ▼
┌──────────────┐
│  Response    │──── Apply persona, format, disclaimers
│  Generator   │
└──────────────┘

Usage Examples

Conversation State Machine

from conversational_ai import StateMachine, State, Transition

booking_flow = StateMachine(
    initial="greeting",
    states=[
        State("greeting", prompt="Welcome! Would you like to book an appointment?"),
        State("collect_date", prompt="What date works best for you?"),
        State("collect_time", prompt="And what time?"),
        State("confirm", prompt="I have {date} at {time}. Shall I confirm?"),
        State("done", prompt="All set! You'll receive a confirmation email."),
    ],
    transitions=[
        Transition("greeting", "collect_date", on_intent="affirm"),
        Transition("collect_date", "collect_time", on_slot="date"),
        Transition("collect_time", "confirm", on_slot="time"),
        Transition("confirm", "done", on_intent="affirm"),
        Transition("confirm", "collect_date", on_intent="deny"),
    ],
)

Context Management Strategies

from conversational_ai.context import SlidingWindow, SummarizingContext

# Simple: Keep last N turns
config = ConversationConfig(
    context_strategy=SlidingWindow(max_turns=15),
)

# Advanced: Summarize older turns, keep recent ones verbatim
config = ConversationConfig(
    context_strategy=SummarizingContext(
        recent_turns=5,
        summary_model="gpt-4o-mini",
        summary_interval=10,
    ),
)

Human Escalation with Context Transfer

from conversational_ai.escalation import EscalationManager

escalation = EscalationManager(
    triggers=[
        {"type": "low_confidence", "threshold": 0.4},
        {"type": "negative_sentiment", "threshold": -0.6},
        {"type": "keyword", "patterns": ["speak to a human", "real person"]},
        {"type": "max_turns_without_resolution", "turns": 8},
    ],
    handoff_format="markdown",
    queue_backend="redis",
)

Configuration

# chatbot_config.yaml
persona:
  name: "Acme Assistant"
  system_prompt: |
    You are a helpful customer support agent for Acme Corp.
    Be friendly but professional. Never make promises about
    refund timelines. Always verify order numbers before lookup.
  tone: "professional_friendly"
  max_response_length: 300

context:
  strategy: "summarizing"
  recent_turns: 5
  summary_model: "gpt-4o-mini"
  max_context_tokens: 4000

intent_classification:
  method: "hybrid"               # rule | llm | hybrid
  model: "gpt-4o-mini"
  confidence_threshold: 0.6
  clarification_prompt: "I'm not sure I understand. Could you rephrase that?"

slots:
  order_id:
    type: "string"
    pattern: '[A-Z]{3}-\d{4}'
    prompt: "Could you share your order number? It looks like ABC-1234."
    required: true
  email:
    type: "email"
    prompt: "What email address is on the account?"

escalation:
  enabled: true
  queue: "redis://localhost:6379/1"
  include_summary: true
  include_sentiment_score: true
  notify_channel: "slack"

Best Practices

Define fallback intents — Always have a graceful "I don't understand" path rather than forcing a bad classification.
Confirm before acting — For any destructive action (cancellation, deletion), always confirm with the user first.
Track conversation quality — Monitor average turns-to-resolution. If it's climbing, your intents or handlers need work.
Use cheap models for classification — Intent classification doesn't need GPT-4. Use gpt-4o-mini or a fine-tuned small model.
Persist sessions — Store conversation state in Redis so users can resume across page reloads or devices.
A/B test personas — Small changes in system prompt wording can significantly impact resolution rates.

Troubleshooting

Problem	Cause	Fix
Bot gives long, rambling responses	No `max_response_length` set or persona too vague	Add explicit length constraints in persona config and system prompt
Intent classification is inaccurate	Too few examples or overlapping intent definitions	Add 10+ diverse examples per intent; merge intents that overlap
Context window overflows on long conversations	Using `sliding_window` with high turn count	Switch to `summarizing` context strategy to compress older turns
Escalation floods human agents	Confidence threshold too high	Lower `confidence_threshold` to 0.5 and add more training examples

This is 1 of 11 resources in the AI Builder Pro toolkit. Get the complete [Conversational AI Templates] with all files, templates, and documentation for $49.

Get the Full Kit →

Or grab the entire AI Builder Pro bundle (11 products) for $169 — save 30%.

Get the Complete Bundle →

Document AI Toolkit

Thesius Code — Mon, 23 Mar 2026 15:14:04 +0000

Document AI Toolkit

Turn unstructured documents into structured, queryable data. This toolkit provides complete pipelines for parsing PDFs, extracting tables, running OCR on scanned documents, summarizing long-form content, and pulling structured fields from invoices, contracts, and reports. Built as composable pipeline stages so you can mix, match, and extend for your specific document types.

Key Features

PDF Parsing Engine — Extract text, metadata, and layout information from PDFs with support for multi-column layouts and embedded images
OCR Integration — Process scanned documents and images with configurable OCR backends (Tesseract, cloud APIs) and pre-processing for skew correction
Table Extraction — Detect and extract tables from PDFs and images into pandas DataFrames or CSV, handling merged cells and spanning headers
Summarization Chains — Multi-stage summarization for long documents: chunk → summarize → merge, with configurable compression ratios
Structured Data Extraction — Define extraction schemas and pull typed fields (dates, amounts, names, addresses) from any document
Document Classification — Automatically categorize incoming documents by type (invoice, contract, report, letter) before routing to specialized extractors
Batch Processing — Process thousands of documents in parallel with progress tracking, retry logic, and partial-failure handling

Quick Start

from document_ai import DocumentPipeline, stages

# 1. Build a pipeline
pipeline = DocumentPipeline([
    stages.PDFParser(extract_images=True),
    stages.OCR(engine="tesseract", language="eng"),
    stages.TableExtractor(output_format="dataframe"),
    stages.Summarizer(model="gpt-4o-mini", max_summary_length=200),
    stages.StructuredExtractor(schema="schemas/invoice.yaml"),
])

# 2. Process a document
result = pipeline.process("documents/invoice_2025_Q3.pdf")

print(result.text[:500])           # Full extracted text
print(result.tables[0].to_csv())   # First table as CSV
print(result.summary)              # LLM-generated summary
print(result.structured_data)      # {"vendor": "Acme Corp", "total": 4250.00, ...}

Architecture

Input Document (PDF / Image / Scan)
         │
         ▼
┌─────────────────┐
│   PDF Parser    │──── Extract text + layout + embedded images
└────────┬────────┘
         │
         ├── Has text ──────────────▶ Text output
         │
         └── Image/Scan ──▶ ┌───────────┐
                            │    OCR     │──── Text from images
                            └─────┬─────┘
                                  │
         ┌────────────────────────┘
         ▼
┌─────────────────┐
│ Table Extractor │──── Detect tables → DataFrames
└────────┬────────┘
         ▼
┌─────────────────┐
│  Summarizer     │──── Chunk → Summarize → Merge
└────────┬────────┘
         ▼
┌─────────────────┐
│ Schema Extractor│──── Extract typed fields per schema
└────────┬────────┘
         ▼
    DocumentResult (text, tables, summary, structured_data, metadata)

Usage Examples

Define Custom Extraction Schemas

from document_ai.extraction import ExtractionSchema, Field

invoice_schema = ExtractionSchema(
    name="invoice",
    fields=[
        Field("vendor_name", type="string", description="Company that issued the invoice"),
        Field("invoice_number", type="string", pattern=r"INV-\d+"),
        Field("date", type="date", formats=["%Y-%m-%d", "%m/%d/%Y"]),
        Field("line_items", type="list", item_schema={
            "description": "string",
            "quantity": "integer",
            "unit_price": "float",
        }),
        Field("total_amount", type="float", description="Total amount due"),
    ],
)

result = pipeline.process("invoice.pdf", schema=invoice_schema)
print(result.structured_data)
# {"vendor_name": "Acme Corp", "invoice_number": "INV-2025-0042",
#  "date": "2025-03-15", "total_amount": 12750.00, "line_items": [...]}

Batch Processing with Progress Tracking

from document_ai import BatchProcessor
from pathlib import Path

processor = BatchProcessor(
    pipeline=pipeline,
    max_workers=4,
    retry_on_failure=True,
    max_retries=2,
)

results = processor.process_directory(
    Path("documents/inbox/"),
    glob_pattern="*.pdf",
    output_dir=Path("documents/processed/"),
)

print(f"Processed: {results.success_count}/{results.total_count}")
print(f"Failed: {[f.filename for f in results.failures]}")

Multi-Stage Summarization for Long Documents

from document_ai.summarization import MapReduceSummarizer

summarizer = MapReduceSummarizer(
    chunk_size=2000,         # Tokens per chunk
    chunk_overlap=200,       # Overlap between chunks
    map_model="gpt-4o-mini", # Cheap model for individual chunks
    reduce_model="gpt-4o",   # Better model for final merge
    final_length=500,        # Target summary length in tokens
)

summary = summarizer.summarize(long_document_text)

Configuration

# document_ai_config.yaml
pdf_parser:
  extract_images: true
  image_dpi: 300               # DPI for image extraction
  layout_analysis: true        # Detect columns, headers, footers
  password: null               # For encrypted PDFs

ocr:
  engine: "tesseract"          # tesseract | google_vision | aws_textract
  language: "eng"
  preprocessing:
    deskew: true               # Correct page rotation
    denoise: true              # Remove noise from scanned docs
    binarize: true             # Convert to black/white
  confidence_threshold: 0.6    # Below this, flag for manual review

table_extraction:
  detection_method: "lattice"  # lattice | stream | hybrid
  output_format: "dataframe"   # dataframe | csv | json
  merge_adjacent: true         # Merge tables split across pages

summarization:
  model: "gpt-4o-mini"
  strategy: "map_reduce"       # map_reduce | refine | stuff
  chunk_size: 2000
  chunk_overlap: 200
  max_summary_length: 300

extraction:
  model: "gpt-4o"
  confidence_threshold: 0.8   # Flag low-confidence extractions
  validate_types: true         # Enforce field type constraints
  schema_dir: "schemas/"

batch:
  max_workers: 4
  max_retries: 2
  output_format: "json"        # json | csv | parquet
  save_intermediate: false     # Save per-stage outputs for debugging

Best Practices

Pre-process scanned documents — Deskew, denoise, and binarize before OCR to dramatically improve text quality.
Use the cheapest model for summarization chunks — Only use GPT-4 for the final merge step; gpt-4o-mini handles individual chunks well.
Define schemas per document type — Generic extraction is weak. Dedicated schemas for invoices, contracts, and reports yield much higher accuracy.
Validate extraction results — Always check confidence_threshold on extracted fields and route low-confidence items for human review.
Process in batches, not one-by-one — The BatchProcessor handles parallelism, retries, and partial failures automatically.
Keep OCR language packs minimal — Only install language packs you actually need to keep the deployment lightweight.

Troubleshooting

Problem	Cause	Fix
OCR produces garbled text	Poor scan quality or wrong language	Enable preprocessing (deskew, denoise) and verify `language` setting
Table extraction misses tables	Tables use borderless/minimal styling	Switch `detection_method` to `stream` or `hybrid`
Summarization loses critical details	Chunk size too small, important info split	Increase `chunk_overlap` to 300+ and use `refine` strategy
Structured extraction returns null fields	Schema field descriptions too vague	Add specific descriptions and example values to each `Field`

This is 1 of 11 resources in the AI Builder Pro toolkit. Get the complete [Document AI Toolkit] with all files, templates, and documentation for $49.

Get the Full Kit →

Or grab the entire AI Builder Pro bundle (11 products) for $169 — save 30%.

Get the Complete Bundle →

Fine-Tuning Pipeline

Thesius Code — Mon, 23 Mar 2026 15:14:00 +0000

Fine-Tuning Pipeline

Everything you need to fine-tune open-source LLMs on your own data — from dataset preparation through training to deployment. This pipeline handles LoRA/QLoRA configuration, training data formatting, hyperparameter management, experiment tracking, model merging, and quantized deployment. Designed for teams running fine-tuning on single GPUs or small clusters without deep ML infrastructure expertise.

Key Features

LoRA & QLoRA Training — Parameter-efficient fine-tuning scripts with automatic rank selection, target module detection, and 4-bit quantization support
Dataset Preparation — Convert raw data (CSV, JSON, conversations) into training-ready formats with deduplication, filtering, and train/val/test splits
Hyperparameter Management — Predefined configs for common base models (Llama, Mistral, Phi) with recommended learning rates, batch sizes, and schedules
Training Monitoring — Real-time loss curves, gradient norms, learning rate schedules, and GPU utilization tracking with automatic early stopping
Model Merging — Merge LoRA adapters back into base models with TIES, DARE, and linear merge strategies
Evaluation Suite — Run benchmarks against your test set automatically after training to validate improvement
Deployment Export — Export merged models in GGUF, ONNX, or SafeTensors format for inference serving

Quick Start

from fine_tuning import Pipeline, DatasetConfig, TrainingConfig

# 1. Prepare dataset
dataset = DatasetConfig(
    source="data/customer_support_conversations.jsonl",
    format="sharegpt",           # sharegpt | alpaca | completion
    train_split=0.9,
    val_split=0.1,
    max_length=2048,
    filter_empty=True,
    dedup_threshold=0.95,        # Remove near-duplicate examples
)

# 2. Configure training
training = TrainingConfig(
    base_model="meta-llama/Llama-3.1-8B",
    method="qlora",              # lora | qlora | full
    lora_rank=64,
    lora_alpha=128,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    learning_rate=2e-4,
    batch_size=4,
    gradient_accumulation=4,     # Effective batch size = 16
    epochs=3,
    warmup_ratio=0.05,
    output_dir="outputs/support-bot-v1",
)

# 3. Run pipeline
pipeline = Pipeline(dataset=dataset, training=training)
result = pipeline.run()
print(f"Final val loss: {result.best_val_loss:.4f}")
print(f"Checkpoint: {result.best_checkpoint}")

Architecture

Raw Data (CSV/JSON/JSONL)
         │
         ▼
┌─────────────────┐
│ Dataset Prep    │──── Format, clean, deduplicate, split
└────────┬────────┘
         ▼
┌─────────────────┐
│  Tokenization   │──── Tokenize + pad/truncate to max_length
└────────┬────────┘
         ▼
┌─────────────────┐
│ Training Loop   │──── LoRA/QLoRA with monitoring + checkpoints
│                 │     ↳ Early stopping on val loss plateau
└────────┬────────┘
         ▼
┌─────────────────┐
│  Evaluation     │──── Run test set benchmarks
└────────┬────────┘
         ▼
┌─────────────────┐
│  Model Merge    │──── Merge adapter into base model
└────────┬────────┘
         ▼
┌─────────────────┐
│  Export         │──── GGUF / ONNX / SafeTensors
└─────────────────┘

Usage Examples

Dataset Format Conversion

from fine_tuning.data import DatasetConverter

# Convert from CSV with question/answer columns to ShareGPT format
converter = DatasetConverter()
converter.from_csv(
    path="data/faq.csv",
    user_column="question",
    assistant_column="answer",
    system_prompt="You are a helpful customer support agent.",
    output_path="data/faq_sharegpt.jsonl",
)

# Convert from raw text completions to instruction format
converter.from_completions(
    path="data/code_samples.txt",
    instruction_template="Complete the following code:",
    output_path="data/code_alpaca.jsonl",
)

Hyperparameter Presets for Common Models

from fine_tuning.presets import get_preset

# Load optimized defaults for Llama 3.1 8B on a single A100
config = get_preset(
    model="llama-3.1-8b",
    gpu="a100-40gb",
    task="chat",           # chat | instruction | completion
)
print(config.learning_rate)   # 2e-4
print(config.batch_size)      # 4
print(config.lora_rank)       # 64

Model Merging Strategies

from fine_tuning.merge import ModelMerger

merger = ModelMerger(
    base_model="meta-llama/Llama-3.1-8B",
    adapter_path="outputs/support-bot-v1/best-checkpoint",
    merge_strategy="ties",     # linear | ties | dare
    output_path="models/support-bot-v1-merged",
    output_format="safetensors",
)
merger.merge()

Configuration

# fine_tuning_config.yaml
dataset:
  source: "data/training_data.jsonl"
  format: "sharegpt"
  max_length: 2048
  train_split: 0.9
  val_split: 0.08
  test_split: 0.02
  preprocessing:
    remove_duplicates: true
    min_length: 10              # Skip very short examples
    max_length: 4096            # Skip very long examples
    filter_language: "en"

training:
  base_model: "meta-llama/Llama-3.1-8B"
  method: "qlora"
  quantization_bits: 4
  lora:
    rank: 64
    alpha: 128
    dropout: 0.05
    target_modules: ["q_proj", "v_proj", "k_proj", "o_proj"]
  optimizer: "adamw_8bit"
  learning_rate: 2e-4
  lr_scheduler: "cosine"
  batch_size: 4
  gradient_accumulation_steps: 4
  epochs: 3
  warmup_ratio: 0.05
  max_grad_norm: 1.0
  save_steps: 100
  eval_steps: 50
  early_stopping_patience: 5

merge:
  strategy: "linear"            # linear | ties | dare
  output_format: "safetensors"  # safetensors | gguf | onnx

monitoring:
  log_to: "tensorboard"         # tensorboard | wandb | csv
  log_dir: "logs/"
  track_gpu_utilization: true

Best Practices

Start with QLoRA on small rank — Begin with rank=16, evaluate, then increase to 32/64 if quality is insufficient. Higher rank = more parameters = slower training.
Clean your data ruthlessly — Fine-tuning amplifies data quality issues. Remove duplicates, fix formatting, and validate every example.
Use a validation set — Always hold out 10% for validation. Watch val loss — if it diverges from train loss, you're overfitting.
Match the chat template — Use the exact chat template (system/user/assistant tags) that the base model was trained with.
Don't over-train — 1-3 epochs is usually sufficient for LoRA. More epochs often leads to catastrophic forgetting of base model capabilities.
Evaluate on real tasks — Loss numbers alone don't tell the full story. Test the fine-tuned model on actual use cases before deploying.
Version your datasets — Hash your training data and log it with each experiment. Reproducibility requires knowing exactly what data was used.

Troubleshooting

Problem	Cause	Fix
Training loss doesn't decrease	Learning rate too low or data format mismatch	Try `5e-4` learning rate; verify chat template matches base model
CUDA out of memory	Batch size too large for GPU VRAM	Reduce `batch_size` to 1 and increase `gradient_accumulation_steps`
Fine-tuned model gives worse results than base	Overfitting or bad training data	Reduce epochs to 1, increase dataset size, check data quality
Merged model produces garbage output	Wrong merge strategy or base model version mismatch	Verify exact base model version matches training; try `linear` merge

This is 1 of 11 resources in the AI Builder Pro toolkit. Get the complete [Fine-Tuning Pipeline] with all files, templates, and documentation for $59.

Get the Full Kit →

Or grab the entire AI Builder Pro bundle (11 products) for $169 — save 30%.

Get the Complete Bundle →

LLM Cost Optimizer

Thesius Code — Mon, 23 Mar 2026 15:13:56 +0000

LLM Cost Optimizer

LLM API costs compound fast — a prototype that costs $5/day can become $500/day in production. This toolkit gives you the instrumentation and strategies to cut LLM spending by 40-70% without sacrificing output quality. Token usage tracking, intelligent model routing, semantic caching, batch processing, and budget alerts — all in one package.

Key Features

Token Usage Tracking — Instrument every LLM call with precise input/output token counts, costs, and latency per model, user, and feature
Smart Model Routing — Automatically route simple queries to cheap models (GPT-4o-mini) and complex queries to powerful models (GPT-4o) based on task complexity scoring
Semantic Caching — Cache responses by semantic similarity, not just exact match. "What's the weather in NYC?" and "NYC weather today?" hit the same cache entry
Batch Processing — Queue non-urgent requests and process them in bulk at 50% lower cost using batch APIs
Budget Alerting — Set daily/weekly/monthly spend limits with Slack/email notifications and automatic circuit breakers
Prompt Compression — Automatically shorten prompts by removing redundant context while preserving meaning
Cost Forecasting — Project future costs based on usage trends and planned feature launches

Quick Start

from cost_optimizer import CostTracker, ModelRouter, SemanticCache

# 1. Wrap your LLM client with cost tracking
tracker = CostTracker(
    storage="sqlite:///costs.db",
    alert_threshold_daily=50.00,  # Alert at $50/day
    alert_channel="slack",
)

# 2. Set up model routing
router = ModelRouter(
    rules=[
        {"complexity": "low", "model": "gpt-4o-mini", "max_tokens": 500},
        {"complexity": "medium", "model": "gpt-4o-mini", "max_tokens": 2000},
        {"complexity": "high", "model": "gpt-4o", "max_tokens": 4000},
    ],
    complexity_classifier="keyword",  # keyword | llm | embedding
)

# 3. Add semantic caching
cache = SemanticCache(
    backend="redis",
    embedding_model="text-embedding-3-small",
    similarity_threshold=0.92,
    ttl_seconds=86400,
)

# 4. Use in your application
@tracker.track(feature="customer_support")
@cache.cached()
def answer_question(question: str) -> str:
    model = router.select_model(question)
    response = llm_client.chat(model=model, messages=[{"role": "user", "content": question}])
    return response.content

answer = answer_question("What's your return policy?")

Architecture

Application Request
        │
        ▼
┌───────────────┐
│ Cost Tracker  │──── Log request metadata (pre-call)
└───────┬───────┘
        ▼
┌───────────────┐
│ Semantic Cache│──── Hit? Return cached response
└───────┬───────┘
        │ Miss
        ▼
┌───────────────┐
│ Model Router  │──── Score complexity → select model
└───────┬───────┘
        ▼
┌───────────────┐
│ Prompt Compress│──── Shorten prompt if over budget
└───────┬───────┘
        │
        ├── Urgent ──▶ Direct API call
        │
        └── Non-urgent ──▶ Batch queue (50% cheaper)
                              │
        ┌─────────────────────┘
        ▼
┌───────────────┐
│ Cost Tracker  │──── Log tokens, cost, latency (post-call)
└───────┬───────┘
        ▼
┌───────────────┐
│ Budget Check  │──── Over limit? Alert + circuit break
└───────────────┘

Usage Examples

Cost Dashboard Queries

from cost_optimizer import CostTracker

tracker = CostTracker(storage="sqlite:///costs.db")

# Daily cost breakdown by model
daily = tracker.report(period="today", group_by="model")
for row in daily:
    print(f"{row.model}: {row.requests} reqs, {row.tokens:,} tokens, ${row.cost:.2f}")

# Weekly trend by feature
weekly = tracker.report(period="7d", group_by="feature")
for row in weekly:
    print(f"{row.feature}: ${row.cost:.2f} ({row.cost_change:+.1%} vs last week)")

# Identify most expensive queries
expensive = tracker.top_queries(period="24h", limit=10, sort_by="cost")
for q in expensive:
    print(f"${q.cost:.3f} | {q.model} | {q.prompt_preview[:80]}...")

Batch Processing for Non-Urgent Requests

from cost_optimizer.batch import BatchQueue

queue = BatchQueue(
    provider="openai",
    check_interval=300,    # Check for results every 5 minutes
    max_batch_size=1000,
)

# Queue requests (returns immediately)
job_ids = []
for question in faq_questions:
    job_id = queue.enqueue(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": question}],
    )
    job_ids.append(job_id)

# Process batch (50% cheaper than individual calls)
results = queue.process_batch()
for job_id, response in zip(job_ids, results):
    print(f"{job_id}: {response.content[:100]}")

Prompt Compression

from cost_optimizer.compression import PromptCompressor

compressor = PromptCompressor(
    method="extractive",       # extractive | abstractive
    target_ratio=0.6,          # Reduce prompt to 60% of original length
    preserve_instructions=True, # Never compress system instructions
)

original_prompt = "Here is the full context of the document..." # 2000 tokens
compressed = compressor.compress(original_prompt)
print(f"Reduced from {original_prompt.token_count} to {compressed.token_count} tokens")
print(f"Estimated savings: ${compressed.cost_savings:.3f}")

Configuration

# cost_optimizer_config.yaml
tracking:
  storage: "sqlite:///costs.db"   # sqlite | postgres
  retention_days: 365
  log_prompts: false              # Don't store prompt text (privacy)

routing:
  classifier: "keyword"           # keyword | llm | embedding
  rules:
    - complexity: "low"
      keywords: ["simple", "yes/no", "list"]
      model: "gpt-4o-mini"
      max_tokens: 500
    - complexity: "high"
      keywords: ["analyze", "compare", "explain in detail"]
      model: "gpt-4o"
      max_tokens: 4000
  default_model: "gpt-4o-mini"

caching:
  backend: "redis"
  redis_url: "redis://localhost:6379/0"
  embedding_model: "text-embedding-3-small"
  similarity_threshold: 0.92
  ttl_seconds: 86400
  max_cache_entries: 100000

alerts:
  daily_limit: 100.00
  weekly_limit: 500.00
  monthly_limit: 1500.00
  channels:
    - type: "slack"
      webhook: "${SLACK_WEBHOOK_URL}"
    - type: "email"
      to: "user@example.com"
  circuit_breaker:
    enabled: true
    threshold: 200.00            # Hard stop at $200/day
    fallback: "return_error"     # return_error | use_cache_only

batch:
  enabled: true
  provider: "openai"
  max_batch_size: 1000
  check_interval_seconds: 300

Best Practices

Track before optimizing — Instrument all calls for 1-2 weeks to identify where costs actually come from before making changes.
Cache deterministic queries — Customer support FAQs, documentation lookups, and classification tasks have high cache hit potential.
Route by task, not by user — A simple question from a premium user still only needs gpt-4o-mini.
Set circuit breakers — A bug in a loop can burn through your monthly budget in minutes. Hard limits prevent this.
Batch everything that can wait — Nightly report generation, content indexing, and analytics don't need real-time responses.
Compress long contexts — RAG contexts often contain 80% irrelevant text. Compress before sending to the LLM.

Troubleshooting

Problem	Cause	Fix
Semantic cache hit rate is very low	Similarity threshold too high (0.98+)	Lower `similarity_threshold` to 0.90-0.92 and monitor response quality
Model router sends everything to cheap model	Complexity classifier too conservative	Add more keywords for "high" complexity or switch to `embedding` classifier
Budget alerts fire but no one notices	Slack webhook expired or email filtered	Test alert channels weekly; add a secondary channel as backup
Batch processing results arrive too late	`check_interval` too high or batch API backlogged	Reduce interval to 60 seconds; set priority on time-sensitive batches

This is 1 of 11 resources in the AI Builder Pro toolkit. Get the complete [LLM Cost Optimizer] with all files, templates, and documentation for $29.

Get the Full Kit →

Or grab the entire AI Builder Pro bundle (11 products) for $169 — save 30%.

Get the Complete Bundle →

LLM Evaluation Framework

Thesius Code — Mon, 23 Mar 2026 15:13:51 +0000

LLM Evaluation Framework

You can't improve what you can't measure. This framework gives you automated, repeatable evaluation harnesses for LLM outputs — with built-in metrics for accuracy, relevance, coherence, and safety, plus custom metric support. Run evaluations in CI/CD, track quality over time, compare models head-to-head, and catch regressions before they reach production.

Key Features

Automated Eval Harnesses — Define test suites as YAML, run them against any model, and get structured scores with statistical significance testing
Built-In Metrics — Accuracy, relevance, coherence, faithfulness, toxicity, and latency measured out of the box
Custom Metrics — Define your own scoring functions (Python callables) and plug them into the evaluation pipeline
Human Feedback Collection — Web-based annotation interface for side-by-side comparisons, Likert scales, and free-text feedback
Regression Testing — Compare current model outputs against a golden baseline and flag any score drops exceeding your threshold
Model Comparison — Run the same eval suite across multiple models/prompts and generate comparison reports with confidence intervals
Quality Monitoring — Continuous evaluation on production traffic with dashboards and alerting on quality degradation
Reproducible Runs — Every evaluation run is versioned with the exact prompt, model, parameters, and dataset hash

Quick Start

from llm_eval import EvalSuite, metrics, Runner

# 1. Define evaluation suite
suite = EvalSuite(
    name="customer_support_v2",
    dataset="eval_data/support_questions.jsonl",
    metrics=[
        metrics.Relevance(model="gpt-4o-mini"),     # LLM-as-judge
        metrics.Faithfulness(sources_key="context"),  # Grounded in context?
        metrics.Coherence(),                          # Well-structured output?
        metrics.Toxicity(threshold=0.1),              # Safe output?
        metrics.Latency(max_p95_ms=2000),             # Performance SLA
    ],
)

# 2. Run evaluation
runner = Runner(model="gpt-4o", temperature=0)
results = runner.evaluate(suite)

# 3. View results
print(results.summary())
# ┌─────────────┬────────┬────────┬────────┐
# │ Metric      │ Mean   │ P5     │ P95    │
# ├─────────────┼────────┼────────┼────────┤
# │ Relevance   │ 0.87   │ 0.72   │ 0.96   │
# │ Faithfulness│ 0.91   │ 0.80   │ 0.98   │
# │ Coherence   │ 0.85   │ 0.68   │ 0.95   │
# │ Toxicity    │ 0.02   │ 0.00   │ 0.08   │
# │ Latency (ms)│ 1240   │ 890    │ 1850   │
# └─────────────┴────────┴────────┴────────┘

Architecture

┌──────────────────────────────────────────────┐
│              Eval Suite Definition            │
│  Dataset + Metrics + Model Config + Baseline  │
└───────────────────┬──────────────────────────┘
                    ▼
┌──────────────────────────────────────────────┐
│                 Runner                        │
│  For each (input, expected) in dataset:       │
│    1. Generate output from model              │
│    2. Score with each metric                  │
│    3. Compare against baseline (if set)       │
└───────────────────┬──────────────────────────┘
                    ▼
┌──────────────────────────────────────────────┐
│              Results Store                    │
│  Scores + Metadata + Diffs + Run ID          │
│                                              │
│  ┌────────────┐  ┌───────────┐  ┌─────────┐ │
│  │ Dashboard  │  │ CI Report │  │ Alerts  │ │
│  └────────────┘  └───────────┘  └─────────┘ │
└──────────────────────────────────────────────┘

Usage Examples

Custom Metrics

from llm_eval.metrics import Metric, MetricResult

class BrandVoiceScore(Metric):
    """Check if output matches brand tone guidelines."""

    name = "brand_voice"

    def __init__(self, guidelines: str):
        self.guidelines = guidelines

    def score(self, input_text: str, output_text: str, **kwargs) -> MetricResult:
        # Use LLM-as-judge to score brand voice adherence
        prompt = f"""Rate how well this response matches our brand voice guidelines.
        Guidelines: {self.guidelines}
        Response: {output_text}
        Score from 0.0 to 1.0:"""

        score = self._llm_judge(prompt)
        return MetricResult(score=score, explanation=f"Brand voice: {score:.2f}")

suite = EvalSuite(
    name="brand_check",
    metrics=[BrandVoiceScore(guidelines="Friendly, concise, no jargon.")],
)

Regression Testing in CI/CD

from llm_eval import RegressionTest

test = RegressionTest(
    suite=suite,
    baseline_run="runs/baseline_2025_03_15",
    max_regression={
        "relevance": 0.05,     # Allow max 5% drop
        "faithfulness": 0.03,  # Allow max 3% drop
        "toxicity": 0.01,      # Almost zero tolerance
    },
)

result = test.run(model="gpt-4o")
if result.has_regressions:
    print("REGRESSIONS DETECTED:")
    for reg in result.regressions:
        print(f"  {reg.metric}: {reg.baseline:.3f} → {reg.current:.3f} ({reg.delta:+.3f})")
    exit(1)  # Fail CI pipeline

Model Comparison

from llm_eval import ModelComparison

comparison = ModelComparison(
    suite=suite,
    models=[
        {"name": "gpt-4o", "temperature": 0},
        {"name": "gpt-4o-mini", "temperature": 0},
        {"name": "claude-sonnet-4-20250514", "temperature": 0},
    ],
)

report = comparison.run()
print(report.ranking())        # Models ranked by aggregate score
print(report.cost_efficiency()) # Score per dollar
report.export_html("reports/model_comparison.html")

Configuration

# eval_config.yaml
suites:
  customer_support:
    dataset: "eval_data/support_questions.jsonl"
    sample_size: 200             # Evaluate on random subset (null = all)
    metrics:
      - name: "relevance"
        judge_model: "gpt-4o-mini"
      - name: "faithfulness"
        sources_key: "context"
      - name: "coherence"
      - name: "toxicity"
        threshold: 0.1
      - name: "latency"
        max_p95_ms: 2000

runner:
  model: "gpt-4o"
  temperature: 0                 # Deterministic for reproducibility
  max_tokens: 1000
  concurrent_requests: 10
  retry_on_failure: true

regression:
  baseline_dir: "baselines/"
  max_regression:
    relevance: 0.05
    faithfulness: 0.03
    coherence: 0.05
    toxicity: 0.01

monitoring:
  enabled: true
  sample_rate: 0.05              # Evaluate 5% of production traffic
  alert_on_degradation: true
  alert_threshold: 0.1           # Alert if metric drops 10% from baseline
  dashboard_port: 8081

storage:
  backend: "sqlite"              # sqlite | postgres
  results_dir: "eval_results/"
  retention_days: 180

Best Practices

Use LLM-as-judge for subjective metrics — Relevance, coherence, and tone are hard to measure with rules. Use a capable model as the judge.
Set baselines early — Run your first eval suite before making changes. You can't detect regression without a baseline.
Evaluate on diverse inputs — Ensure your dataset covers edge cases, long inputs, multi-language queries, and adversarial prompts.
Separate metric concerns — A high relevance score with low faithfulness means the model is making up plausible-sounding answers.
Run evals in CI — Every prompt change, model swap, or system prompt edit should trigger the regression suite.
Monitor production quality — Eval datasets get stale. Sample real production traffic for continuous evaluation.

Troubleshooting

Problem	Cause	Fix
LLM-as-judge scores are inconsistent	Judge model temperature > 0	Set `temperature: 0` for the judge model; run each judgment 3x and average
Eval suite takes too long	Dataset too large or concurrent requests too low	Use `sample_size` to subset and increase `concurrent_requests`
Regression test fails on every run	Baseline is stale or threshold too tight	Update baseline with `test.update_baseline()` and relax thresholds
Toxicity scores are always 0	Test data doesn't include adversarial inputs	Add red-team prompts to your eval dataset to stress-test safety

This is 1 of 11 resources in the AI Builder Pro toolkit. Get the complete [LLM Evaluation Framework] with all files, templates, and documentation for $49.

Get the Full Kit →

Or grab the entire AI Builder Pro bundle (11 products) for $169 — save 30%.

Get the Complete Bundle →

LLM Prompt Engineering Kit

Thesius Code — Mon, 23 Mar 2026 15:13:47 +0000

LLM Prompt Engineering Kit

Prompts are the new code — and they deserve the same rigor. This kit provides a structured prompt template library, proven chain-of-thought patterns, a version control system for prompts, and an A/B testing framework to measure which prompts actually perform better. Stop guessing and start engineering your prompts systematically.

Key Features

Prompt Template Library — 50+ battle-tested templates for classification, extraction, summarization, code generation, and creative writing with variable interpolation
Chain-of-Thought Patterns — Ready-to-use CoT, Tree-of-Thought, and self-consistency templates that improve reasoning accuracy by 15-40%
Few-Shot Example Management — Dynamic example selection based on semantic similarity to the input query
Prompt Versioning — Git-like version control for prompts with diff, rollback, and deployment tagging
A/B Testing Framework — Split traffic between prompt variants, measure performance metrics, and declare winners with statistical significance
Prompt Optimization — Automated prompt refinement using DSPy-style optimization loops
Output Format Enforcement — JSON schema constraints, regex validation, and structured output parsing built into templates

Quick Start

from prompt_kit import PromptTemplate, PromptRegistry, FewShotSelector

# 1. Define a prompt template with variables
template = PromptTemplate(
    name="customer_classifier",
    version="1.2",
    system="""You are a customer intent classifier. Classify the customer message
into exactly one of these categories: {categories}.
Respond with JSON: {{"intent": "<category>", "confidence": <0.0-1.0>}}""",
    user="{customer_message}",
    variables={
        "categories": ["billing", "technical", "sales", "cancellation", "general"],
    },
    output_schema={
        "type": "object",
        "properties": {
            "intent": {"type": "string"},
            "confidence": {"type": "number"},
        },
        "required": ["intent", "confidence"],
    },
)

# 2. Render and use
prompt = template.render(customer_message="I can't log into my account")
print(prompt.system)  # Fully interpolated system prompt
print(prompt.user)    # "I can't log into my account"

# 3. Parse and validate output
result = template.parse_output('{"intent": "technical", "confidence": 0.92}')
print(result.intent)      # "technical"
print(result.confidence)  # 0.92

Architecture

┌────────────────────────────────────────────┐
│            Prompt Registry                  │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐   │
│  │Template A│ │Template B│ │Template C│   │
│  │ v1.0     │ │ v2.1     │ │ v1.3     │   │
│  └────┬─────┘ └────┬─────┘ └────┬─────┘   │
│       │            │            │          │
│  ┌────▼────────────▼────────────▼──────┐   │
│  │         Version Control             │   │
│  │  Diff │ Rollback │ Tags │ History   │   │
│  └────────────────┬────────────────────┘   │
│                   │                        │
│  ┌────────────────▼────────────────────┐   │
│  │         A/B Test Engine             │   │
│  │  Split │ Measure │ Significance     │   │
│  └─────────────────────────────────────┘   │
└────────────────────────────────────────────┘
        │
        ▼
┌────────────────┐    ┌────────────────┐
│ Few-Shot       │    │ Output         │
│ Selector       │    │ Parser         │
│ (semantic sim) │    │ (JSON schema)  │
└────────────────┘    └────────────────┘

Usage Examples

Chain-of-Thought Prompting

from prompt_kit.patterns import ChainOfThought, SelfConsistency

# Standard CoT
cot = ChainOfThought(
    task="Solve this math problem step by step.",
    examples=[
        {"input": "What is 15% of 240?", "reasoning": "15% = 0.15. 0.15 × 240 = 36.", "answer": "36"},
    ],
)
prompt = cot.render(input="A store has a 25% off sale. The original price is $80. What is the sale price?")

# Self-consistency: Run CoT N times, take majority vote
sc = SelfConsistency(
    chain=cot,
    num_samples=5,
    aggregation="majority_vote",  # majority_vote | weighted
)
result = sc.run(input="If a train travels 60mph for 2.5 hours, how far does it go?")
print(result.answer)      # "150 miles"
print(result.consistency)  # 1.0 (all 5 samples agreed)

Dynamic Few-Shot Selection

from prompt_kit import FewShotSelector

selector = FewShotSelector(
    examples_path="examples/classification_examples.jsonl",
    embedding_model="text-embedding-3-small",
    num_examples=3,   # Select 3 most similar examples
)

# Automatically picks the most relevant examples for this input
examples = selector.select("My payment was declined but the charge still shows up")
# Returns 3 examples semantically closest to billing/payment issues

Prompt Versioning

from prompt_kit import PromptRegistry

registry = PromptRegistry(storage="prompts/")

# Save a version
registry.save(template, tag="production")

# List versions
versions = registry.history("customer_classifier")
for v in versions:
    print(f"v{v.version} | {v.timestamp} | {v.tag or 'untagged'}")

# Roll back
registry.rollback("customer_classifier", to_version="1.1")

# Diff two versions
diff = registry.diff("customer_classifier", "1.1", "1.2")
print(diff)  # Shows exact prompt text changes

A/B Testing Prompts

from prompt_kit.ab_test import ABTest

test = ABTest(
    name="classifier_v1_vs_v2",
    variants={
        "control": registry.get("customer_classifier", version="1.1"),
        "treatment": registry.get("customer_classifier", version="1.2"),
    },
    traffic_split={"control": 0.5, "treatment": 0.5},
    metric="accuracy",
    min_samples=500,
    significance_level=0.05,
)

# Route each request through the test
variant, prompt = test.assign(user_id="user_456")
# ... use prompt, collect result ...
test.record(user_id="user_456", metric_value=1.0)  # Correct classification

# Check results
report = test.report()
print(f"Control: {report.control.mean:.3f} ± {report.control.ci:.3f}")
print(f"Treatment: {report.treatment.mean:.3f} ± {report.treatment.ci:.3f}")
print(f"Winner: {report.winner} (p={report.p_value:.4f})")

Configuration

# prompt_kit_config.yaml
registry:
  storage: "prompts/"            # Local directory for prompt files
  format: "yaml"                 # yaml | json
  auto_version: true             # Auto-increment version on save

few_shot:
  examples_dir: "examples/"
  embedding_model: "text-embedding-3-small"
  num_examples: 3
  similarity_metric: "cosine"    # cosine | euclidean

ab_testing:
  storage: "sqlite:///ab_tests.db"
  default_traffic_split: 0.5
  min_samples_per_variant: 500
  significance_level: 0.05
  auto_promote_winner: false     # Require manual promotion

output_parsing:
  strict_mode: true              # Fail on schema violation
  max_retries: 3                 # Retry with corrective prompt
  retry_model: "gpt-4o-mini"    # Cheap model for format fixing

templates_dir: "templates/"      # Pre-built template library

Best Practices

Version every prompt change — A single word change in a system prompt can shift behavior dramatically. Track every edit.
Use few-shot examples over long instructions — Models learn patterns from examples better than from verbose rules.
Test with edge cases — Your prompt works on happy paths. Test it with empty inputs, adversarial inputs, and multi-language content.
Measure before declaring winners — Run A/B tests to statistical significance (p < 0.05). Gut feeling is not a metric.
Separate concerns in prompts — Keep persona, task instructions, format requirements, and examples in distinct sections.
Pin model versions — A prompt optimized for gpt-4o-2024-08-06 may perform differently on the next model release.

Troubleshooting

Problem	Cause	Fix
Output doesn't match JSON schema	Model ignores schema constraints	Add "You MUST respond with valid JSON" to system prompt and enable `max_retries`
Few-shot selector returns irrelevant examples	Embedding model mismatch or too few examples	Use `text-embedding-3-small` and add 50+ diverse examples to the pool
A/B test never reaches significance	Effect size too small or too few samples	Increase `min_samples` or accept that the variants are equivalent
Prompt version rollback doesn't change behavior	Application caching old prompt	Clear application-level prompt cache after rollback

This is 1 of 11 resources in the AI Builder Pro toolkit. Get the complete [LLM Prompt Engineering Kit] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire AI Builder Pro bundle (11 products) for $169 — save 30%.

Get the Complete Bundle →

RAG Pipeline Framework

Thesius Code — Mon, 23 Mar 2026 15:13:43 +0000

RAG Pipeline Framework

Retrieval-Augmented Generation is the most practical way to give LLMs access to your private data — but getting it right is harder than the tutorials suggest. This framework provides the complete pipeline: document ingestion, intelligent chunking, embedding generation, vector store integration, retrieval with reranking, and answer generation with source citations. Plus evaluation tools to measure whether your RAG system actually returns correct answers.

Key Features

Document Ingestion — Load PDFs, Markdown, HTML, Word docs, and plain text with automatic format detection and metadata extraction
Chunking Strategies — Fixed-size, semantic, recursive, and document-structure-aware chunking with configurable overlap
Embedding Generation — Support for OpenAI, Cohere, and local embedding models with automatic batching and rate limiting
Vector Store Integration — Pluggable backends for ChromaDB, Pinecone, Weaviate, pgvector, and FAISS with unified query interface
Hybrid Search — Combine dense vector search with sparse keyword search (BM25) for higher recall
Reranking — Cross-encoder reranking to boost precision after initial retrieval
Citation Generation — Automatically include source references in generated answers with page numbers and document titles
RAG Evaluation — Measure retrieval precision, answer faithfulness, and end-to-end accuracy

Quick Start

from rag_pipeline import RAGPipeline, ChunkingStrategy, VectorStore

# 1. Configure the pipeline
pipeline = RAGPipeline(
    chunking=ChunkingStrategy(
        method="recursive",
        chunk_size=500,
        chunk_overlap=50,
        separators=["\n\n", "\n", ". ", " "],
    ),
    embedding_model="text-embedding-3-small",
    vector_store=VectorStore(
        backend="chromadb",
        collection="company_docs",
        persist_dir="./vector_db",
    ),
    llm_model="gpt-4o",
)

# 2. Ingest documents
pipeline.ingest("documents/", glob="**/*.{pdf,md,txt}")
print(f"Ingested {pipeline.document_count} documents, {pipeline.chunk_count} chunks")

# 3. Query with RAG
result = pipeline.query("What is our refund policy for enterprise customers?")
print(result.answer)
print(f"\nSources ({len(result.sources)}):")
for src in result.sources:
    print(f"  - {src.document}: p.{src.page} (relevance: {src.score:.2f})")

Architecture

Documents (PDF, MD, HTML, TXT)
         │
         ▼
┌─────────────────┐
│  Document Loader│──── Parse + extract text + metadata
└────────┬────────┘
         ▼
┌─────────────────┐
│   Chunker       │──── Split into retrieval-sized chunks
└────────┬────────┘
         ▼
┌─────────────────┐
│  Embedder       │──── Generate vector embeddings
└────────┬────────┘
         ▼
┌─────────────────┐
│  Vector Store   │──── Store embeddings + metadata
└────────┬────────┘
         │
    Query Time:
         │
User Query ──▶ Embed ──▶ Vector Search ──▶ Rerank ──▶ LLM + Context ──▶ Answer
                              │                              │
                         Top K chunks                  Source citations

Usage Examples

Chunking Strategy Comparison

from rag_pipeline.chunking import FixedSizeChunker, SemanticChunker, RecursiveChunker

# Fixed-size: Predictable chunk sizes, may break mid-sentence
fixed = FixedSizeChunker(chunk_size=500, overlap=50)

# Semantic: Chunks at natural topic boundaries using embeddings
semantic = SemanticChunker(
    embedding_model="text-embedding-3-small",
    breakpoint_threshold=0.3,   # Lower = more chunks
    min_chunk_size=100,
    max_chunk_size=1000,
)

# Recursive: Tries multiple separators in order of preference
recursive = RecursiveChunker(
    chunk_size=500,
    overlap=50,
    separators=["\n\n", "\n", ". ", " "],  # Paragraphs → lines → sentences → words
)

Hybrid Search with Reranking

from rag_pipeline.retrieval import HybridRetriever, CrossEncoderReranker

retriever = HybridRetriever(
    dense_weight=0.7,           # 70% vector similarity
    sparse_weight=0.3,          # 30% BM25 keyword match
    top_k=20,                   # Retrieve 20 candidates
    reranker=CrossEncoderReranker(
        model="cross-encoder/ms-marco-MiniLM-L-6-v2",
        top_n=5,                # Return top 5 after reranking
    ),
)

results = retriever.search("enterprise refund policy", vector_store=store)
for r in results:
    print(f"[{r.score:.3f}] {r.document} — {r.text[:100]}...")

RAG Evaluation

from rag_pipeline.evaluation import RAGEvaluator

evaluator = RAGEvaluator(
    test_dataset="eval_data/qa_pairs.jsonl",  # Questions with ground truth answers
    metrics=["retrieval_precision", "answer_faithfulness", "answer_correctness"],
    judge_model="gpt-4o-mini",
)

scores = evaluator.evaluate(pipeline)
print(f"Retrieval Precision@5: {scores.retrieval_precision:.2%}")
print(f"Answer Faithfulness:   {scores.answer_faithfulness:.2%}")
print(f"Answer Correctness:    {scores.answer_correctness:.2%}")

Metadata Filtering

# Only search within specific document types or date ranges
result = pipeline.query(
    "What changed in the Q3 update?",
    filters={
        "document_type": "changelog",
        "date": {"$gte": "2025-07-01"},
        "department": "engineering",
    },
    top_k=5,
)

Configuration

# rag_config.yaml
ingestion:
  supported_formats: ["pdf", "md", "txt", "html", "docx"]
  metadata_extraction: true
  skip_existing: true            # Don't re-ingest unchanged docs
  watch_directory: false         # Auto-ingest new files

chunking:
  method: "recursive"            # fixed | semantic | recursive
  chunk_size: 500                # Tokens
  chunk_overlap: 50
  separators: ["\n\n", "\n", ". ", " "]
  include_metadata: true         # Attach source doc metadata to chunks

embedding:
  model: "text-embedding-3-small"
  batch_size: 100
  rate_limit_rpm: 3000
  dimensions: 1536

vector_store:
  backend: "chromadb"            # chromadb | pinecone | weaviate | pgvector | faiss
  collection: "company_docs"
  persist_dir: "./vector_db"

retrieval:
  method: "hybrid"               # dense | sparse | hybrid
  dense_weight: 0.7
  sparse_weight: 0.3
  top_k: 20
  reranker:
    enabled: true
    model: "cross-encoder/ms-marco-MiniLM-L-6-v2"
    top_n: 5

generation:
  model: "gpt-4o"
  temperature: 0.1
  max_tokens: 1000
  system_prompt: |
    Answer the question based ONLY on the provided context.
    If the context doesn't contain the answer, say "I don't have
    enough information to answer that." Always cite your sources.
  include_citations: true

evaluation:
  test_dataset: "eval_data/qa_pairs.jsonl"
  metrics: ["retrieval_precision", "answer_faithfulness", "answer_correctness"]
  judge_model: "gpt-4o-mini"
  sample_size: 100

Best Practices

Chunk size matters more than you think — Too small (< 200 tokens) loses context. Too large (> 1000 tokens) dilutes relevance. Start at 500 and tune with eval.
Always use overlap — 10-20% overlap prevents information loss at chunk boundaries.
Hybrid search beats dense-only — Adding BM25 catches keyword-exact matches that embedding models miss (proper nouns, error codes, IDs).
Reranking is worth the latency — A cross-encoder reranker on your top-20 results dramatically improves the top-5 quality.
Include metadata in chunks — Prepend document title, section headers, and dates to each chunk. The LLM needs this context.
Evaluate on real questions — Build a test set of 100+ real user questions with ground truth answers. Run it after every pipeline change.

Troubleshooting

Problem	Cause	Fix
Retrieved chunks are irrelevant to the query	Chunk size too large or embedding model weak	Reduce chunk size to 300-500; try `text-embedding-3-large` for better retrieval
Answers hallucinate despite relevant context	System prompt doesn't enforce grounding	Add "ONLY use the provided context" constraint and enable faithfulness evaluation
Ingestion is slow on large document sets	Embedding API rate limits	Increase `batch_size`, use local embedding model, or set `skip_existing: true`
Duplicate chunks from overlapping documents	Same content in multiple source files	Enable deduplication with `dedup_threshold: 0.95` in chunking config

This is 1 of 11 resources in the AI Builder Pro toolkit. Get the complete [RAG Pipeline Framework] with all files, templates, and documentation for $59.

Get the Full Kit →

Or grab the entire AI Builder Pro bundle (11 products) for $169 — save 30%.

Get the Complete Bundle →

Vector Database Toolkit

Thesius Code — Mon, 23 Mar 2026 15:13:38 +0000

Vector Database Toolkit

Vector databases are the backbone of every RAG pipeline, semantic search engine, and recommendation system — but each one has different APIs, indexing strategies, and operational quirks. This toolkit gives you unified setup guides, working code examples, and benchmarking scripts for ChromaDB, Pinecone, Weaviate, and pgvector. Plus hybrid search patterns, indexing strategies, and production operational guides.

Key Features

Multi-Database Support — Unified Python client abstraction for ChromaDB, Pinecone, Weaviate, and pgvector with consistent CRUD operations
Setup & Migration Guides — Step-by-step setup for each database, including Docker configs, cloud provisioning, and schema migration scripts
Indexing Strategies — HNSW, IVF, and PQ index configuration with tuning guides for recall vs. speed tradeoffs
Hybrid Search — Combine dense vector search with sparse keyword search across all supported backends
Benchmarking Scripts — Measure query latency, throughput, recall@K, and memory usage across databases with your own data
Production Operations — Backup/restore procedures, monitoring queries, scaling guides, and cost estimation per database
Embedding Pipeline — Batch embedding generation with rate limiting, retry logic, and incremental upsert support

Quick Start

from vector_toolkit import VectorClient, EmbeddingPipeline

# 1. Initialize with any backend (same API for all)
client = VectorClient(
    backend="chromadb",
    connection={
        "persist_directory": "./chroma_db",
    },
    collection="product_catalog",
    embedding_model="text-embedding-3-small",
    dimensions=1536,
)

# 2. Index documents
documents = [
    {"id": "doc_1", "text": "Premium leather wallet with RFID blocking", "category": "accessories"},
    {"id": "doc_2", "text": "Wireless noise-canceling headphones", "category": "electronics"},
    {"id": "doc_3", "text": "Organic cotton crew neck t-shirt", "category": "apparel"},
]
client.upsert(documents, text_key="text", metadata_keys=["category"])

# 3. Search
results = client.search("high-quality audio equipment", top_k=5)
for r in results:
    print(f"[{r.score:.3f}] {r.id}: {r.text}")

Architecture

┌─────────────────────────────────────────────┐
│           VectorClient (Unified API)         │
│                                              │
│  upsert() │ search() │ delete() │ count()   │
└──────────────────┬───────────────────────────┘
                   │
    ┌──────────────┼──────────────┐
    │              │              │
    ▼              ▼              ▼
┌────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐
│ChromaDB│  │ Pinecone │  │ Weaviate │  │ pgvector │
│(local) │  │ (cloud)  │  │(hybrid)  │  │  (SQL)   │
└────────┘  └──────────┘  └──────────┘  └──────────┘
                   │
                   ▼
         ┌─────────────────┐
         │EmbeddingPipeline│
         │ Batch + Rate    │
         │ Limit + Retry   │
         └─────────────────┘

Usage Examples

Switch Backends Without Code Changes

# Development: Use ChromaDB (local, no setup)
dev_client = VectorClient(backend="chromadb", connection={"persist_directory": "./db"})

# Staging: Use pgvector (existing Postgres)
staging_client = VectorClient(
    backend="pgvector",
    connection={
        "host": "localhost",
        "port": 5432,
        "database": "vectors",
        "user": "app_user",
        "password": "${PGVECTOR_PASSWORD}",
    },
)

# Production: Use Pinecone (managed, scalable)
prod_client = VectorClient(
    backend="pinecone",
    connection={
        "api_key": "${PINECONE_API_KEY}",
        "environment": "us-east-1",
        "index_name": "product-catalog",
    },
)

# Same code works with all three:
results = client.search("wireless headphones", top_k=5)

Hybrid Search (Dense + Sparse)

from vector_toolkit.search import HybridSearch

hybrid = HybridSearch(
    client=client,
    dense_weight=0.7,
    sparse_weight=0.3,     # BM25 keyword matching
    fusion="reciprocal_rank",  # reciprocal_rank | weighted_sum
)

results = hybrid.search(
    query="error code ERR-4012 connection timeout",
    top_k=10,
    filters={"category": "troubleshooting"},
)
# Dense search finds semantically similar docs about connection issues
# Sparse search catches the exact error code "ERR-4012"

Benchmarking

from vector_toolkit.benchmark import Benchmark

bench = Benchmark(
    backends=["chromadb", "pgvector", "pinecone"],
    dataset="benchmark_data/1M_embeddings.npy",
    queries="benchmark_data/1000_queries.npy",
    ground_truth="benchmark_data/ground_truth.json",
    metrics=["latency_p50", "latency_p99", "recall_at_10", "throughput_qps"],
)

report = bench.run()
print(report.table())
# ┌───────────┬────────────┬────────────┬───────────┬──────────┐
# │ Backend   │ Latency P50│ Latency P99│ Recall@10 │ QPS      │
# ├───────────┼────────────┼────────────┼───────────┼──────────┤
# │ ChromaDB  │ 12ms       │ 45ms       │ 0.94      │ 180      │
# │ pgvector  │ 18ms       │ 62ms       │ 0.92      │ 250      │
# │ Pinecone  │ 22ms       │ 58ms       │ 0.96      │ 1200     │
# └───────────┴────────────┴────────────┴───────────┴──────────┘

Index Tuning

from vector_toolkit.indexing import IndexConfig

# HNSW (best for most use cases)
hnsw_config = IndexConfig(
    type="hnsw",
    m=16,                    # Connections per node (higher = better recall, more memory)
    ef_construction=200,     # Build-time accuracy (higher = better index, slower build)
    ef_search=100,           # Query-time accuracy (higher = better recall, slower query)
)

# IVF (better for very large datasets > 10M vectors)
ivf_config = IndexConfig(
    type="ivf",
    nlist=1024,              # Number of clusters
    nprobe=32,               # Clusters to search (higher = better recall, slower)
)

client.create_index(config=hnsw_config)

Configuration

# vector_toolkit_config.yaml
default_backend: "chromadb"

backends:
  chromadb:
    persist_directory: "./chroma_db"
    anonymized_telemetry: false

  pinecone:
    api_key: "${PINECONE_API_KEY}"
    environment: "us-east-1"
    index_name: "product-catalog"
    metric: "cosine"             # cosine | dotproduct | euclidean
    pod_type: "s1.x1"

  weaviate:
    url: "https://api.example.com"
    api_key: "${WEAVIATE_API_KEY}"
    schema_auto_create: true

  pgvector:
    host: "localhost"
    port: 5432
    database: "vectors"
    user: "${PG_USER}"
    password: "${PG_PASSWORD}"
    pool_size: 10

embedding:
  model: "text-embedding-3-small"
  dimensions: 1536
  batch_size: 100
  rate_limit_rpm: 3000
  retry_max: 3
  retry_delay_seconds: 1

indexing:
  type: "hnsw"
  m: 16
  ef_construction: 200
  ef_search: 100

search:
  default_top_k: 10
  hybrid_enabled: true
  dense_weight: 0.7
  sparse_weight: 0.3
  fusion_method: "reciprocal_rank"

benchmark:
  dataset_sizes: [10000, 100000, 1000000]
  query_count: 1000
  output_dir: "benchmark_results/"

Best Practices

Start with ChromaDB locally, migrate to managed in production — ChromaDB requires zero setup for prototyping; switch to Pinecone/Weaviate when you need scale.
Choose the right distance metric — Use cosine for normalized embeddings (most common), dotproduct for unnormalized, euclidean for absolute distances.
Tune HNSW parameters for your recall target — Default m=16, ef=100 gives ~95% recall. For 99%+ recall, increase ef_search to 200+.
Use metadata filters before vector search — Filtering first, then searching the filtered subset is much faster than searching everything and post-filtering.
Batch your upserts — Insert documents in batches of 100-500. Single-document inserts are 10-50x slower.
Benchmark with YOUR data — Published benchmarks use synthetic data. Run the benchmarking scripts with your actual embeddings and query patterns.

Troubleshooting

Problem	Cause	Fix
Search returns irrelevant results	Wrong distance metric or poor embedding model	Switch to `cosine` metric; try `text-embedding-3-large` for better quality
Upsert is extremely slow	Single-document inserts or no batching	Use `client.upsert_batch()` with batch_size=500
pgvector queries slow on large tables	Missing HNSW or IVF index	Run `client.create_index()` — without an index, pgvector does brute-force scan
Pinecone returns timeout errors	Index not fully initialized or quota exceeded	Wait 2-3 minutes after index creation; check plan limits in Pinecone console

This is 1 of 11 resources in the AI Builder Pro toolkit. Get the complete [Vector Database Toolkit] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire AI Builder Pro bundle (11 products) for $169 — save 30%.

Get the Complete Bundle →

Experiment Tracking Pack

Thesius Code — Mon, 23 Mar 2026 15:13:34 +0000

Experiment Tracking Pack

Production-ready experiment tracking with Weights & Biases and MLflow. Stop losing track of what you tried — log every hyperparameter, metric, and artifact automatically. Compare runs side-by-side, reproduce any experiment, and share results with your team.

Key Features

Dual-backend tracking — log to W&B and MLflow simultaneously with a unified API
Custom comparison dashboards — pre-built templates for metric visualization across runs
Hyperparameter sweep tracking — structured logging for grid, random, and Bayesian searches
Artifact versioning — automatically version model checkpoints, datasets, and configs
Reproducibility configs — capture environment, git hash, and random seeds per experiment
Team collaboration — shared project dashboards with role-based access patterns
Alerting on metric regression — configurable thresholds that flag degraded runs early
Export and reporting — generate PDF/HTML reports from tracked experiments

Quick Start

# 1. Copy and edit the config
cp config.example.yaml config.yaml

# 2. Set credentials
export WANDB_API_KEY=YOUR_API_KEY_HERE
export MLFLOW_TRACKING_URI=http://localhost:5000

# 3. Run your first tracked experiment
python examples/tracked_experiment.py

"""Minimal tracked training loop."""
import wandb
import mlflow
from tracker import ExperimentTracker

config = {
    "learning_rate": 0.001,
    "epochs": 50,
    "batch_size": 32,
    "model": "resnet18",
}

tracker = ExperimentTracker(
    project="image-classification",
    backends=["wandb", "mlflow"],
    config=config,
)

with tracker.start_run(run_name="baseline-v1"):
    for epoch in range(config["epochs"]):
        train_loss = train_one_epoch(model, dataloader)
        val_acc = evaluate(model, val_loader)

        tracker.log({
            "train/loss": train_loss,
            "val/accuracy": val_acc,
            "epoch": epoch,
        })

    tracker.log_artifact("model.pt", artifact_type="model")
    tracker.log_artifact("config.yaml", artifact_type="config")

Architecture

experiment-tracking-pack/
├── config.example.yaml          # Tracking backend configuration
├── templates/
│   ├── tracker.py               # Unified ExperimentTracker class
│   ├── callbacks.py             # Training framework callbacks (PyTorch, sklearn)
│   ├── dashboards/              # Pre-built W&B dashboard JSON exports
│   │   ├── training_overview.json
│   │   └── hyperparam_comparison.json
│   └── reports/                 # Report generation templates
├── docs/
│   ├── overview.md              # Full architecture walkthrough
│   ├── patterns/                # Tracking patterns for common scenarios
│   └── checklists/
│       └── pre-deployment.md    # Go-live checklist
└── examples/
    ├── tracked_experiment.py    # Basic usage
    └── sweep_tracking.py        # Hyperparameter sweep logging

The ExperimentTracker wraps both W&B and MLflow behind a single interface. You call tracker.log() once and metrics flow to both backends. Switch backends by editing config.yaml — zero code changes.

Usage Examples

PyTorch Lightning Callback

from tracker import ExperimentTracker
import pytorch_lightning as pl

class TrackingCallback(pl.Callback):
    def __init__(self, tracker: ExperimentTracker):
        self.tracker = tracker

    def on_train_epoch_end(self, trainer, pl_module):
        metrics = trainer.callback_metrics
        self.tracker.log({
            "train/loss": metrics["train_loss"].item(),
            "val/accuracy": metrics.get("val_acc", 0.0),
            "epoch": trainer.current_epoch,
        })

Comparing Runs Programmatically

from tracker import ExperimentTracker

tracker = ExperimentTracker(project="image-classification")
runs = tracker.get_runs(filters={"tag": "baseline"}, order_by="val/accuracy")

for run in runs[:5]:
    print(f"{run.name}: acc={run.metrics['val/accuracy']:.4f}, "
          f"lr={run.config['learning_rate']}")

Configuration

# config.example.yaml
project_name: "my-ml-project"

backends:
  wandb:
    enabled: true
    entity: "your-team"         # W&B team or username
    log_model: true             # Upload model artifacts
    log_code: true              # Snapshot source code

  mlflow:
    enabled: true
    tracking_uri: "http://localhost:5000"
    registry_uri: "sqlite:///mlflow.db"
    auto_log: true              # Enable MLflow autologging

logging:
  log_frequency: 10             # Log every N steps
  log_system_metrics: true      # GPU utilization, memory
  capture_git_hash: true        # Record git commit
  capture_env: true             # Record pip freeze

Best Practices

Log config at run start — always pass your full hyperparameter dict to the tracker before training begins
Use tags, not names, for filtering — run names should be human-readable; use tags like ["baseline", "v2", "augmented"] for programmatic queries
Set metric summary modes — configure W&B to track min(val/loss) and max(val/accuracy) for leaderboard views
Version your tracking config — commit config.yaml to git so experiment setup is reproducible
Use run groups for sweeps — group related hyperparameter search runs for cleaner dashboards

Troubleshooting

Problem	Cause	Fix
`wandb: ERROR Run initialization failed`	Invalid API key or network issue	Verify `WANDB_API_KEY` with `wandb login --verify`
Metrics not appearing in MLflow UI	Wrong `tracking_uri` or MLflow server down	Check `mlflow server` is running; test with `curl $MLFLOW_TRACKING_URI/api/2.0/mlflow/experiments/list`
Duplicate runs on resume	Missing `resume` flag	Set `tracker.start_run(resume="must")` for resumed training
Slow logging with large artifacts	Synchronous upload blocking training	Enable `async_upload: true` in config or log artifacts only at end of run

This is 1 of 10 resources in the ML Starter Kit toolkit. Get the complete [Experiment Tracking Pack] with all files, templates, and documentation for $29.

Get the Full Kit →

Or grab the entire ML Starter Kit bundle (10 products) for $149 — save 30%.

Get the Complete Bundle →

Forem: Thesius Code

AI Agent Framework

AI Agent Framework

Key Features

Quick Start

Architecture

Usage Examples

ReAct Planning Loop

Multi-Agent Delegation

Memory Configuration

Configuration

Best Practices

Troubleshooting

Related Articles

AI API Gateway

AI API Gateway

Key Features

Quick Start

Architecture

Usage Examples

Fallback Routing with Cost Controls

Streaming Responses

Custom Middleware

Usage Analytics

Configuration

Best Practices

Troubleshooting

Related Articles

AI Safety & Guardrails Kit

AI Safety & Guardrails Kit

Key Features

Quick Start

Architecture

Usage Examples

Custom Content Policies

Hallucination Detection Against Source Documents

Streaming Output Filtering

Configuration

Best Practices

Troubleshooting

Related Articles

Conversational AI Templates

Conversational AI Templates

Key Features

Quick Start

Architecture

Usage Examples

Conversation State Machine

Context Management Strategies

Human Escalation with Context Transfer

Configuration

Best Practices

Troubleshooting

Related Articles

Document AI Toolkit

Document AI Toolkit

Key Features

Quick Start

Architecture

Usage Examples

Define Custom Extraction Schemas

Batch Processing with Progress Tracking

Multi-Stage Summarization for Long Documents

Configuration

Best Practices

Troubleshooting

Related Articles

Fine-Tuning Pipeline

Fine-Tuning Pipeline

Key Features

Quick Start

Architecture

Usage Examples

Dataset Format Conversion

Hyperparameter Presets for Common Models

Model Merging Strategies

Configuration

Best Practices

Troubleshooting

Related Articles