Forem: Johann Hagerer

Presentation Slides for #StopTheSlop

Johann Hagerer — Wed, 15 Apr 2026 09:19:41 +0000

The following slides I presented on PyCon 2026 as lightning talk.

Vimeo Link - jump to 05:45:00 to watch the presentation.

Stop the Slop!

A 5-Minute Call for Action for the AI-Assisted Workplace

PyCon DE 2026 — Lightning Talk

It's Official.

"AI slop": Merriam-Webster 2025 Word of the Year.

Low-quality digital content produced, usually in quantity, by means of AI.

Like "spam", but now it's coming from your team.
Your colleagues, PRs, tickets, chat.
No spam filter, no mute button, no protection.

Nobody is safe.

The Numbers

40% of workers received "AI slop" in the past month
15% of all content received at work is AI slop
$9M / year in lost productivity for a 10k-person org
61% hide their AI use from colleagues
55% pass off AI output as their own work
2/3 never check AI output before sending it

-> AI should save time. Now, we're drowning.

curl Rage-Quit

Daniel Stenberg (curl maintainer) shut down a bug bounty after too much AI slop:

"We now ban every reporter INSTANTLY who submits AI slop. We are effectively being DDoSed."

curl's security.txt now reads:

"We will ban you and ridicule you in public if you waste our time on crap reports."

📊 20 submissions in 16 hours. None were real vulnerabilities.

Code Reviews Gone Wrong

A study of 1,1k developer posts:

Individual devs ship faster. Reviewers, maintainers & the community pay the price.

Example: An AI agent hallucinated external services, then mocked out these services, i.e. an internally consistent but verbose hallucination.

"They're literally just using you to do their job — critically evaluate their AI slop and give it the next prompt."

What Can You Actually Do?

As a colleague:

Be transparent — "I used AI for this" lets collaborators calibrate effort.
- BTW: I actually used AI for this presentation
Keep generations small — Every slide in this presentation is short and thus easily quality assured.
Manually refine — Read all AI output. Make sure you understand it and remove unnecessary and wrong information.

What Can You Actually Do?

As a developer:

Two-prompt rule — If the 2nd attempt doesn't improve, stop prompting. Write it yourself.
Keep generations small — If you can't explain the output, reject it.
Be transparent — "I used AI for this" lets reviewers calibrate effort.
Don't dump AI plans on teammates — Pasting a raw Claude plan into a ticket is asking others to do your work.

What Can You Actually Do?

As a team:

Review = understanding — If no one understands the code, it doesn't ship.
Name it — If you see slop, say "this looks like slop." Normalize the feedback.
Rule of thumb: If you can't defend it in a review, don't submit it.
Establish best practices: DSPI (look it up!)

Remember:

AI slop is not a technology problem.

It's a laziness problem.

The tool is fine. The copy-paste is the crime.

Resources:

🌐 stoptheslop.dev — Internal dev team policy template
📄 arxiv.org/html/2603.27249 — "An Endless Stream of AI Slop" (academic study)
📝 daniel.haxx.se/blog — curl maintainer's AI slop chronicles
🔧 github.com/hardikpandya/stop-slop — Claude skill to de-slop your prose

Sources:

BetterUp Labs / Stanford Social Media Lab (2025) — hbr.org/2026/01/why-people-create-ai-workslop-and-how-to-stop-it
Global survey of 32,352 workers across 47 countries — theconversation.com/ai-workslop-is-creating-unnecessary-extra-work-heres-how-we-can-stop-it-267110
Heidelberg / Melbourne / Singapore, 2026

Spec-Driven Development Based on DSPI: Design-Specify-Plan-Implement

Johann Hagerer — Wed, 15 Apr 2026 08:37:58 +0000

Abstract

DSPI (Design → Specify → Plan → Implement) is a phased workflow for software development with AI coding agents like Claude Code and Snowflake Cortex Code. Instead of throwing a ticket description at an agent and hoping for the best, you break the work into four gated phases --- each producing a markdown document that becomes the contract for the next phase. The result is a structured paper trail of architecture decisions, behavioral specs, and implementation tasks that the agent follows step by step. This guide walks you through the folder structure, the document templates for each phase, and the custom slash commands that tie it all together.

Project Setup
1. Folder Structure
2. doc/architecture/architecture.md
3. doc/architecture/coding_style.md
4. doc/architecture/documentation_style.md
5. doc/architecture/tech_stack.md
6. doc/architecture/project_structure.md
Per-Ticket Workflow
1. Create the ticket folder
2. doc/issues/<ticket-id>/requirements.md
3. doc/issues/<ticket-id>/arch.md
4. doc/issues/<ticket-id>/specs.md
5. doc/issues/<ticket-id>/*task.md
Cortex Code Skills
1. doc/.cortex/skills/dspi-design/SKILL.md

1. Project Setup

Before starting any tickets, set up the project-level documentation. These files are written once and updated as the project evolves. They provide context that Cortex Code reads when working on any ticket.

Folder Structure

doc/                                # Everything lives here
  DSPI_GUIDE.md                     # This guide
  architecture/                     # Project-level documentation
    architecture.md
    coding_style.md
    documentation_style.md
    tech_stack.md
    project_structure.md
  issues/                           # Per-ticket folders
    <ticket-id>/
      requirements.md
      specs.md
      arch.md
      01_task.md
      02_task.md
      ...
  .cortex/                          # Cortex Code skills
    skills/
      dspi-design/SKILL.md
      dspi-spec/SKILL.md
      dspi-plan/SKILL.md
      dspi-implement/SKILL.md

doc/architecture/architecture.md

High-level system architecture. Cortex Code reads this to understand how components fit together.

# Architecture

## Overview
<!-- One paragraph describing the system at a high level. -->

## Components
<!-- List each major component/service with a short description. -->

## Data Flow
<!-- Describe how data moves through the system. ASCII diagrams are fine. -->

## Key Decisions
<!-- Document architectural decisions and why they were made. -->

doc/architecture/coding_style.md

Coding conventions the team follows. Cortex Code uses this to generate code that matches your style.

# Coding Style

## Language Conventions
<!-- Language-specific rules: naming, formatting, imports. -->

## Patterns
<!-- Preferred patterns: error handling, logging, dependency injection, etc. -->

## Anti-Patterns
<!-- Things to avoid. Be specific. -->

## Examples
<!-- Short before/after examples showing preferred style. -->

doc/architecture/documentation_style.md

How to write documentation, comments, and commit messages.

# Documentation Style

## Code Comments
<!-- When to comment, when not to. Inline vs block. -->

## Commit Messages
<!-- Format, conventions, examples. -->

## File/Module Documentation
<!-- What goes at the top of each file or module. -->

doc/architecture/tech_stack.md

Technologies used and why.

# Tech Stack

## Languages
<!-- Languages and versions. -->

## Frameworks & Libraries
<!-- Key dependencies and their purpose. -->

## Infrastructure
<!-- Databases, message queues, cloud services, CI/CD. -->

## Development Tools
<!-- Linters, formatters, test frameworks. -->

doc/architecture/project_structure.md

Map of the repository so Cortex Code knows where things live.

# Project Structure

## Directory Layout
<!-- Tree view of the repo with descriptions for each directory. -->

## Entry Points
<!-- Main entry points for the application. -->

## Configuration
<!-- Where config files live, how environments are managed. -->

## Tests
<!-- Test directory structure, how to run tests. -->

2. Per-Ticket Workflow

Every ticket gets its own folder under doc/issues/. The folder is named after the ticket ID (e.g., doc/issues/42/, doc/issues/AUTH-123/). Each file in the folder corresponds to a DSPI phase.

Create the ticket folder

mkdir -p doc/issues/<ticket-id>

doc/issues/<ticket-id>/requirements.md

Phase: before Design. Captures what needs to be built and why. Written by the developer or product owner. This is the input to the Design phase.

# Requirements: <ticket-id>

## Problem
<!-- What problem does this solve? Why does it matter? -->

## Desired Outcome
<!-- What should the system do when this is complete? -->

## Constraints
<!-- Performance, security, compatibility, scope limits. -->

## Acceptance Criteria
- [ ] Criterion 1
- [ ] Criterion 2
- [ ] Criterion 3

doc/issues/<ticket-id>/arch.md

Phase: Design. Ticket-level architecture. How this specific feature fits into the system. References doc/architecture/architecture.md for the big picture.

# Architecture: <ticket-id>

## Approach
<!-- How will this be built? Which components are involved? -->

## Changes to Existing Components
<!-- What existing code/services need to change? -->

## New Components
<!-- What new code/services are introduced? -->

## Data Model Changes
<!-- Schema changes, new tables, migrations. -->

## Tradeoffs
<!-- What alternatives were considered? Why was this approach chosen? -->

doc/issues/<ticket-id>/specs.md

Phase: Specify. Detailed, unambiguous specification of behavior. This is the contract -- implementation must match the spec.

# Specification: <ticket-id>

## API Contract
<!-- Endpoints, request/response schemas, status codes. -->
<!-- Or: function signatures, input/output types. -->

## Business Rules
<!-- Validation logic, conditional behavior, calculations. -->

## Edge Cases
<!-- Boundary conditions, empty states, concurrent access. -->

## Error Handling
<!-- What errors can occur? How is each handled? -->

## Security
<!-- Authentication, authorization, input sanitization. -->

doc/issues/<ticket-id>/*task.md

Phase: Plan. Individual implementation tasks, numbered for ordering. Each task is small enough to complete in one focused session. These are the steps Cortex Code executes.

Name them 01_task.md, 02_task.md, etc.

# Task 01: <short description>

## Objective
<!-- What does this task accomplish? -->

## Files to Change
<!-- List specific files that need modification or creation. -->

## Steps
1. Step one
2. Step two
3. Step three

## Verification
<!-- How to confirm this task is done correctly. -->
<!-- Commands to run, tests to pass, behavior to check. -->

## Dependencies
<!-- Which other tasks must be complete before this one? -->

3. Cortex Code Skills

These skills automate each DSPI phase. Install them by creating the files under doc/.cortex/skills/ in your repo.

mkdir -p doc/.cortex/skills/dspi-design
mkdir -p doc/.cortex/skills/dspi-spec
mkdir -p doc/.cortex/skills/dspi-plan
mkdir -p doc/.cortex/skills/dspi-implement

doc/.cortex/skills/dspi-design/SKILL.md

---
name: dspi-design
description: "Design phase: read requirements, produce architecture. Triggers: design, architecture, design phase."
tools: ["bash", "read", "write", "edit", "glob", "grep"]
---

# DSPI Design Phase

## Input
- `doc/issues/<ticket-id>/requirements.md`
- `doc/architecture/architecture.md` (system context)
- `doc/architecture/tech_stack.md` (technology constraints)

## Workflow
1. Read the requirements file for the given ticket
2. Read `doc/architecture/architecture.md` and `doc/architecture/tech_stack.md` for system context
3. Explore the codebase to understand existing patterns and components involved
4. Write `doc/issues/<ticket-id>/arch.md` covering:
   - Approach and rationale
   - Changes to existing components
   - New components introduced
   - Data model changes
   - Tradeoffs considered

## Stopping Points
- After reading requirements: confirm understanding with user
- After drafting arch.md: get approval before finalizing

## Rules
- Do not write any application code in this phase
- Reference specific files and line numbers in the codebase when describing changes
- Keep the architecture document concise -- one page maximum
```

`

### doc/.cortex/skills/dspi-spec/SKILL.md

`

```markdown
---
name: dspi-spec
description: "Spec phase: read architecture, produce detailed specification. Triggers: spec, specify, specification."
tools: ["bash", "read", "write", "edit", "glob", "grep"]
---

# DSPI Spec Phase

## Input
- `doc/issues/<ticket-id>/requirements.md`
- `doc/issues/<ticket-id>/arch.md`
- `doc/architecture/coding_style.md` (style constraints)

## Workflow
1. Read the requirements and architecture for the given ticket
2. Read `doc/architecture/coding_style.md` for conventions
3. Explore the codebase for existing interfaces, types, and patterns to follow
4. Write `doc/issues/<ticket-id>/specs.md` covering:
   - API contracts or function signatures with exact types
   - Business rules and validation logic
   - Edge cases and boundary conditions
   - Error handling behavior
   - Security considerations
5. Map each acceptance criterion from requirements.md to a specific spec section

## Stopping Points
- After reading arch.md: confirm scope with user
- After drafting specs.md: review for completeness before finalizing

## Rules
- Do not write any application code in this phase
- Every acceptance criterion must be traceable to a spec section
- Be precise -- specs are the contract that implementation is verified against
- Include concrete examples for complex business rules
```

`

### doc/.cortex/skills/dspi-plan/SKILL.md

`

```markdown
---
name: dspi-plan
description: "Plan phase: read spec, produce ordered task files. Triggers: plan, tasks, break down, task breakdown."
tools: ["bash", "read", "write", "edit", "glob", "grep"]
---

# DSPI Plan Phase

## Input
- `doc/issues/<ticket-id>/specs.md`
- `doc/issues/<ticket-id>/arch.md`
- `doc/architecture/project_structure.md` (where files live)

## Workflow
1. Read the spec and architecture for the given ticket
2. Read `doc/architecture/project_structure.md` to understand repo layout
3. Explore the codebase to identify exact files and functions to change
4. Break the spec into ordered implementation tasks
5. Write task files as `doc/issues/<ticket-id>/01_task.md`, `02_task.md`, etc.
   Each task file contains:
   - Clear objective
   - Specific files to change
   - Step-by-step instructions
   - Verification criteria
   - Dependencies on other tasks

## Stopping Points
- After reading spec: confirm task granularity with user (fewer large tasks vs many small ones)
- After writing all task files: review the full plan before implementation

## Rules
- Do not write any application code in this phase
- Each task should be completable in one focused session
- Tasks must be ordered so dependencies are respected
- Every spec requirement must be covered by at least one task
- Include a verification step in every task
```

`

### doc/.cortex/skills/dspi-implement/SKILL.md

`

```markdown
---
name: dspi-implement
description: "Implement phase: execute tasks in order, verify against spec. Triggers: implement, build, code, execute tasks."
tools: ["bash", "read", "write", "edit", "glob", "grep"]
---

# DSPI Implement Phase

## Input
- `doc/issues/<ticket-id>/specs.md` (the contract)
- `doc/issues/<ticket-id>/arch.md` (the design)
- `doc/issues/<ticket-id>/*task.md` (ordered tasks)
- `doc/architecture/coding_style.md` (style rules)

## Workflow
1. Read the spec, architecture, and all task files for the given ticket
2. Read `doc/architecture/coding_style.md` for conventions
3. Create a todo list from the task files
4. For each task, in order:
   a. Read the task file
   b. Implement the changes described
   c. Run the verification steps from the task file
   d. Mark the task as complete
5. After all tasks are done:
   a. Run the full test suite
   b. Verify every acceptance criterion from requirements.md
   c. Verify implementation matches specs.md

## Stopping Points
- Before starting: confirm the task list with user
- After each task: report what was done and verification results
- After all tasks: final review against spec

## Rules
- Follow `doc/architecture/coding_style.md` strictly
- Do not deviate from the spec without explicit approval
- If a task reveals a gap in the spec, stop and flag it rather than guessing
- Run verification after every task, not just at the end
```

`

---

## 4. Workflow Walkthrough

Here is the complete flow for a single ticket.

### Step 1: Write Requirements

Create the ticket folder and write what needs to be built.

```bash
mkdir -p doc/issues/42
```

Write `doc/issues/42/requirements.md` with the problem, desired outcome, constraints, and acceptance criteria.

### Step 2: Design

Use the design skill to generate the architecture.

```shell
$dspi-design Ticket 42 -- @doc/issues/42/requirements.md
```

Cortex Code will:
- Read your requirements
- Explore the codebase
- Write `doc/issues/42/arch.md`
- Ask for your approval

Review the architecture. Edit if needed. Move on when satisfied.

### Step 3: Specify

Use the spec skill to generate the detailed specification.

```shell
$dspi-spec Ticket 42 -- @doc/issues/42/requirements.md @doc/issues/42/arch.md
```

Cortex Code will:
- Read the requirements and architecture
- Explore existing code for patterns
- Write `doc/issues/42/specs.md`
- Ask for your review

Review the spec carefully. This is the contract -- implementation will be verified against it.

### Step 4: Plan

Use the plan skill to break the spec into tasks.

```shell
$dspi-plan Ticket 42 -- @doc/issues/42/specs.md @doc/issues/42/arch.md
```

Cortex Code will:
- Read the spec and architecture
- Identify files to change
- Write `doc/issues/42/01_task.md`, `02_task.md`, etc.
- Present the full task list for approval

Review the tasks. Adjust granularity if needed.

### Step 5: Implement

Use the implement skill to execute the plan.

```shell
$dspi-implement Ticket 42 -- @doc/issues/42/specs.md @doc/issues/42/arch.md
```

Cortex Code will:
- Read all specs and task files
- Create a tracked todo list
- Implement each task in order
- Run verification after each task
- Do a final check against the spec

### Step 6: Review

After implementation, review the changes against the spec.

```plaintext
Review the changes in this session against @doc/issues/42/specs.md -- flag any deviations.
```

---

## 5. Quick Reference

### Commands

| Phase | Command |
|---|---|
| Setup ticket | `mkdir -p doc/issues/<id>` then write `requirements.md` |
| Design | `$dspi-design Ticket <id> -- @doc/issues/<id>/requirements.md` |
| Specify | `$dspi-spec Ticket <id> -- @doc/issues/<id>/requirements.md @doc/issues/<id>/arch.md` |
| Plan | `$dspi-plan Ticket <id> -- @doc/issues/<id>/specs.md @doc/issues/<id>/arch.md` |
| Implement | `$dspi-implement Ticket <id> -- @doc/issues/<id>/specs.md @doc/issues/<id>/arch.md` |

### Phase Gates

| Phase | Produces | Gate |
|---|---|---|
| Requirements | `requirements.md` | Developer/PO writes and reviews |
| Design | `arch.md` | Review architecture before proceeding |
| Specify | `specs.md` | Review spec -- this is the contract |
| Plan | `*task.md` files | Review task breakdown and ordering |
| Implement | Code changes | Verify against spec and acceptance criteria |

### File Reference for @ Context

When prompting Cortex Code, use `@` to inject relevant files:

```plaintext
@doc/architecture/architecture.md      # System-level context
@doc/architecture/coding_style.md      # Style rules
@doc/architecture/tech_stack.md        # Technology constraints
@doc/architecture/project_structure.md # Repo layout
@doc/issues/42/requirements.md         # What to build
@doc/issues/42/arch.md                 # How to build it
@doc/issues/42/specs.md                # Detailed behavior contract
@doc/issues/42/01_task.md              # Current task
```

### Key Principle

**Each phase is a gate.** Do not skip phases. Do not move forward until the current phase is reviewed and approved. The documents are the contract -- Cortex Code implements what the documents say, not what it assumes.

A Brief Terminology of Insurance Claims Adjustment

Johann Hagerer — Fri, 03 Apr 2026 11:58:45 +0000

Term	German	Definition
Claims adjustment	Schadenregulierung	The end-to-end process of investigating, evaluating, and settling an insurance claim, from first notice of loss through final payment or denial.
Coverage review	Deckungsübersicht / Deckungsprüfung (allgemein)	A general review of what a policy covers, often performed proactively or as an initial screening step before a detailed determination.
Clause verification	Klauselprüfung	Checking whether a specific policy clause (e.g., an exclusion, sublimit, or condition precedent) applies to a given set of facts.
Policy interpretation	Vertragsauslegung	Resolving the meaning of ambiguous or disputed policy language, often guided by legal principles such as the Unklarheitenregel (contra proferentem). Also used as a named NLP task in the LegalBench benchmark.
Coverage determination	Materielle Deckungsprüfung / Deckungsentscheidung	The substantive decision of whether a policy covers a specific claim, integrating clause verification, policy interpretation, and the facts of the loss into a definitive outcome (covered / not covered / partially covered).

How Should Students Document AI Usage in Academic Work?

Johann Hagerer — Thu, 26 Mar 2026 10:19:29 +0000

As AI tools become deeply embedded in how we write, code, and think, universities are grappling with a deceptively hard question: how do you regulate something that's invisible by design?

The Institute of Statistics at LMU Munich recently published guidelines for AI tool usage in academic work that I think strike a remarkably pragmatic balance. Rather than banning AI or pretending it doesn't exist, they treat it as what it is — a tool that needs the same transparency we already expect for other tools and sources. I want to walk through the key ideas here, because I believe they're relevant well beyond academia.

The Core Philosophy: Responsibility and Transparency

The guidelines rest on two pillars that are hard to argue with.

Responsibility. Students bear full responsibility for every word they submit, regardless of which tools helped produce it. If you can't explain it, you shouldn't submit it. This applies to prose and program code equally.

Transparency. AI tool usage must be documented. This isn't some new bureaucratic burden — it's the natural extension of existing academic practice. We already cite sources, disclose co-author contributions, and list our tools. AI assistance is just the next entry in that list.

What Documentation Actually Looks Like

Every academic work must include a dedicated "AI Tools" section of roughly 0.25–1 page. Think of it as the "author contributions" statement you'd find in a multi-author dissertation, but for your AI interactions. The section should cover three things:

Which AI tools were used and how they were generally applied
Which sections of the work involved AI assistance
How extensively AI-generated content was revised

Additionally, when an AI tool contributed an essential line of thought — not just phrasing, but actual reasoning — that should be flagged via footnotes, similar to how you'd cite a human source.

The Appropriate vs. Inappropriate Divide

This is where it gets interesting. The guidelines don't draw a binary "AI allowed / AI forbidden" line. Instead, they define a spectrum.

Generally appropriate:

Linguistic and stylistic corrections
Translation assistance
Topic exploration and literature discovery (AI as a tutor)
Structuring support
Verifying algebraic transformations or integral solutions
Programming support
Creating graphics and diagrams

Generally inappropriate:

Direct adoption of AI-generated text or code without genuine understanding — especially for content-central parts like core theoretical sections, literature reviews, or key algorithm implementations
Using AI for the main data analysis or interpretation
Any undocumented AI usage

The nuance matters here. Using an LLM to scaffold your code and then refactoring it with understanding? Fine. Pasting in a prompt, copying out the result, and submitting it without being able to explain what it does? That's the line.

And there's a deliberate escape valve: students are encouraged to discuss the boundaries with their supervisors, because what counts as "appropriate" depends heavily on context.

Assessment: Quality Over Origin

One of the more refreshing aspects of these guidelines is the explicit statement that documented AI usage, in accordance with the rules, does not lead to a grade penalty. The quality of the work remains the primary criterion.

That said, there's a balancing mechanism: the oral defense or examination carries increased weight. You need to be able to explain and defend every aspect of your work in detail. This elegantly solves the verification problem — if you can't walk through your own code or reasoning in person, the documentation won't save you.

If AI usage goes undocumented, consequences range from grade reductions to formal deception charges.

Practical Documentation Examples

The guidelines include several example statements that are worth reading for the tone they set. They all follow a consistent pattern: "I wrote all parts of this work myself. Additionally, I used [tool] for [purpose]. The output was [reviewed/adapted/not adopted verbatim]."

A few paraphrased examples of what good documentation looks like:

"I used ChatGPT to refine individual sentences in the introduction. Suggestions were manually reviewed, adapted, and reformulated — not adopted verbatim."
"I used GitHub Copilot to check and optimize my programming scripts for XY. Suggestions for improving runtime and memory usage were comprehended, tested, and adopted where appropriate. All adoptions are documented in code comments."
"I used Claude Sonnet 4 with web search to discover relevant sources and have theoretical concepts explained. Based on this, I consulted the primary literature directly and created my own summaries."

Notice the pattern: specific tool, specific use case, specific description of how the output was handled. No vague hand-waving.

A Taxonomy Worth Stealing

The guidelines also include detailed taxonomies for different categories of AI usage. I think these are useful reference material for anyone documenting AI-assisted work, not just students.

In writing: grammar checking, citation management, plagiarism detection, formatting, style improvement, paraphrasing, translation, literature review drafting, source summarization, content expansion, section composition, and simulated peer review.

In programming: inline completion, prompt-to-code generation, project scaffolding, code explanation, documentation generation, debugging, test generation, refactoring, performance optimization, API usage examples, dependency management, AI pair programming, and simulated code review.

In mathematics: symbolic manipulation, step-by-step tutorials, visualization, formula translation, conjecture generation, proof sketch drafting, formal proof synthesis, counterexample search, proof verification, and auto-formalization of natural language into formal logic.

These categories aren't exhaustive, and they're explicitly not all automatically "appropriate" — they're a vocabulary for describing what you did.

Why This Matters Beyond Academia

I spend a lot of my time building evaluation and observability infrastructure for LLM workflows. One thing that's become clear is that the documentation problem isn't unique to universities. Any team using LLMs in production faces the same question: how do we track what the model contributed vs. what a human decided?

The LMU approach — structured documentation, clear responsibility, quality-first assessment, and verification through explanation — maps surprisingly well onto engineering practices. Swap "oral defense" for "code review where you explain your PR," and "AI Tools section" for "commit messages and PR descriptions that disclose AI assistance," and you're most of the way there.

The key insight is that transparency isn't about restricting AI usage. It's about maintaining accountability. And that's a principle that scales from a bachelor's thesis to a production ML pipeline.

The full original guidelines (in English) are available from the Institute of Statistics at LMU Munich. If your institution is working on similar policies, they're worth reading in full — they're one of the more thoughtful takes I've seen on this topic.

LLM Non-Determinism: What Providers Guarantee, and How to Build Around It

Johann Hagerer — Wed, 25 Mar 2026 17:15:35 +0000

This post is based on sections 3--5 of "Understanding why deterministic output from LLMs is nearly impossible" by Shuveb Hussain at Unstract (Oct. 2025). The material has been rewritten and extended with practical code examples using Pydantic and Snowflake Cortex.

Motivation

When you build a pipeline that sends the same document through an LLM twice and expects the same structured output both times, you will eventually be surprised. Not because you've made a mistake, but because LLMs are fundamentally non-deterministic: the same prompt can produce different tokens across runs, even when you set temperature=0.

The root cause is that modern LLMs run on massively parallel GPU hardware, where floating-point arithmetic is not associative. The order in which thousands of parallel threads accumulate intermediate values is not guaranteed to be identical run-to-run, so the final token probability scores shift by tiny amounts. When two candidate tokens are close in probability, that tiny shift can flip which one gets selected. Because LLMs generate text auto-regressively --- each token conditions all subsequent tokens --- a single early flip can cascade into a structurally different response by the tenth token.

You cannot eliminate this. What you can do is design your system so that it is robust to it. The rest of this article shows you how, concretely, using Pydantic and Snowflake Cortex.

What Providers Actually Promise

No major LLM provider guarantees deterministic output. Snowflake Cortex, which hosts models like mistral-large2 and claude-3-5-sonnet via the complete function in snowflake-ml-python, is no exception. You can observe the problem directly:

from snowflake.cortex import complete
from snowflake.snowpark.context import get_active_session

session = get_active_session()

prompt = (
    "Extract the vendor name and total amount from this invoice: "
    "Acme Corp, Invoice #1042, Total: EUR 3,200.00. "
    "Respond in JSON."
)

run_1 = complete("mistral-large2", prompt, session=session)
run_2 = complete("mistrat-large2", prompt, session=session)

print(run_1)  # {"vendor": "Acme Corp", "total_amount": 3200.00}
print(run_2)  # {"vendor": "Acme Corp", "amount": 3200.0}  ← different field name

The values agree, but the field names differ. A downstream system parsing total_amount will fail silently on the second response.

The reasons providers haven't solved this are pragmatic. Deterministic GPU operations are substantially slower, routing across a distributed fleet makes identical hardware execution nearly impossible, and most applications tolerate small variations gracefully. The seed parameter, where available, only controls sampling randomness --- it does nothing about floating-point drift from parallel reductions, which is the dominant source of variation at temperature=0.

The practical takeaway: treat non-determinism as a fixed environmental property, like network latency. You don't eliminate it; you engineer around it.

Best Practices for Structured Extraction

Enforce Structure with `response_format` and Pydantic

The most effective tool available is Snowflake Cortex's structured output feature. You define your expected output as a Pydantic BaseModel, convert it to a JSON schema with .model_json_schema(), and pass it to complete via CompleteOptions. The model is then constrained to emit output conforming to that schema, and you validate the result back into your Pydantic object.

import json
from pydantic import BaseModel, Field
from snowflake.cortex import complete, CompleteOptions
from snowflake.snowpark.context import get_active_session

class InvoiceExtraction(BaseModel):
    vendor_name: str = Field(description="Full legal name of the vendor")
    invoice_number: str = Field(description="Invoice identifier, e.g. INV-1042")
    total_amount: float = Field(description="Total amount due, numeric only")
    currency: str = Field(description="ISO 4217 currency code, e.g. EUR")

session = get_active_session()

options = CompleteOptions(
    temperature=0,
    response_format={
        "type": "json",
        "schema": InvoiceExtraction.model_json_schema()
    }
)

prompt = "Extract structured data from: Acme Corp, Invoice #1042, Total: EUR 3,200.00"

raw = complete(
    model="mistral-large2",
    prompt=prompt,
    session=session,
    options=options,
)

result = InvoiceExtraction.model_validate_json(raw)

print(result.vendor_name)    # "Acme Corp"
print(result.total_amount)   # 3200.0
print(result.currency)       # "EUR"

The response_format eliminates the most common failure mode: field name variation. The model no longer chooses between total_amount, amount, and sum --- the schema decides. The subsequent model_validate_json call gives you a fully typed Python object and raises a ValidationError if anything is malformed.

Write Unambiguous Schemas

The schema is a contract. Vague field descriptions produce vague outputs. Be explicit about types, formats, and what to do when data is missing. Note that Snowflake Cortex has some schema constraints --- numeric range keywords like minimum/maximum are not supported, and property names may only contain letters, digits, hyphens, and underscores.

from pydantic import BaseModel, Field
from typing import Optional

class LineItem(BaseModel):
    description: str = Field(description="Product or service name, as written on the invoice")
    quantity: int = Field(description="Number of units. Convert 'dozen' to 12, 'pair' to 2.")
    unit_price: float = Field(description="Price per unit in the invoice currency, numeric only")

class InvoiceExtraction(BaseModel):
    vendor_name: str = Field(description="Full legal name of the vendor")
    invoice_date: str = Field(description="Invoice date in YYYY-MM-DD format. Convert all date formats.")
    total_amount: float = Field(description="Total amount due, numeric only, no currency symbol")
    currency: str = Field(description="ISO 4217 currency code, e.g. EUR, USD, GBP")
    line_items: list[LineItem] = Field(description="All line items listed on the invoice")
    purchase_order_number: Optional[str] = Field(
        default=None,
        description="PO reference number if present on the invoice, otherwise null"
    )

The Optional[str] with default=None on purchase_order_number is important. Without it, the model might hallucinate a PO number when none is present rather than omit the field --- a subtle but production-breaking behavior.

Anchor Behavior with Few-Shot Examples

Schema constraints define the structure; few-shot examples define the behavior within that structure. They are especially valuable for edge cases: relative dates, quantity words like "dozen", and optional fields that should be null. With a simple string prompt, the examples go directly into the prompt text.

from snowflake.cortex import complete, CompleteOptions
from snowflake.snowpark.context import get_active_session

session = get_active_session()

options = CompleteOptions(
    temperature=0,
    response_format={
        "type": "json",
        "schema": InvoiceExtraction.model_json_schema()
    }
)

def build_prompt(invoice_text: str) -> str:
    return f"""You extract structured invoice data. Respond in JSON.

Example 1:
Input: "From: Bolt GmbH | Ref: RE-2024-991 | Date: yesterday | 5x Widget A @ EUR 12.50 | Total: EUR 62.50"
Output: vendor_name="Bolt GmbH", invoice_date="2025-03-24", total_amount=62.50,
        currency="EUR", line_items=[{{description="Widget A", quantity=5, unit_price=12.50}}],
        purchase_order_number=null

Example 2:
Input: "DataPipe Inc | Invoice 5531 | PO: PO-8821 | 1 doz. API calls @ USD 0.01 | USD 0.12 total"
Output: vendor_name="DataPipe Inc", invoice_date=null, total_amount=0.12,
        currency="USD", line_items=[{{description="API calls", quantity=12, unit_price=0.01}}],
        purchase_order_number="PO-8821"

Now extract from:
{invoice_text}"""

raw = complete(
    model="mistral-large2",
    prompt=build_prompt(invoice_text),
    session=session,
    options=options,
)

result = InvoiceExtraction.model_validate_json(raw)

Without the "dozen → 12" example, the model might return quantity=1 with description="dozen API calls". With it, the conversion is consistent.

Measure Variance to Drive Improvement

Variance you don't measure is variance you can't improve. Run the same documents through your pipeline multiple times and compare the results. High variance on specific fields points directly at under-specified Field(description=...) strings or missing few-shot examples.

from collections import Counter

def measure_extraction_variance(
    invoice_text: str,
    n_runs: int = 5,
    session=None,
) -> dict:

    results = []
    for _ in range(n_runs):
        result = extract_with_retry(invoice_text, session=session)
        if result:
            results.append(result.model_dump())

    if not results:
        return {"error": "all runs failed"}

    variance_report = {}
    for field in results[0]:
        values = [str(r[field]) for r in results]
        top_value, top_count = Counter(values).most_common(1)[0]
        variance_report[field] = {
            "stable": len(set(values)) == 1,
            "unique_values": list(set(values)),
            "agreement_rate": top_count / n_runs,
        }

    return variance_report

# Example output:
# {
#   "vendor_name":  {"stable": True,  "agreement_rate": 1.0,  ...},
#   "invoice_date": {"stable": False, "agreement_rate": 0.6,  ...},  # ← needs attention
# }

A field with agreement_rate < 0.8 across five runs is a direct signal to improve its description or add a targeted few-shot example. This feedback loop is how schemas and prompts mature over time.

The Right Mental Model

The most useful shift you can make is to stop treating non-determinism as a bug waiting to be fixed and start treating it as a property of the environment --- one you design around, not against.

The analogy from distributed systems is apt. TCP/IP is built on top of unreliable packet delivery; the reliability lives in the protocol layer, not in the assumption of a perfect physical network. A reliable LLM pipeline puts its correctness guarantees in the surrounding system --- Pydantic validation, retry logic, normalization, business rules --- not in the assumption that the model will always produce identical output.

With Snowflake Cortex and snowflake-ml-python, this maps to a clear, implementable layering:

Raw document
      ↓
complete(prompt=..., options=CompleteOptions(response_format=...))   ← flexibility lives here
      ↓
model_validate_json() + retry on ValidationError                     ← structure enforced here
      ↓
NormalizedInvoice model_validator                                    ← canonical field names here
      ↓
Business logic / Snowflake table                                     ← determinism lives here

The complete call is allowed to be flexible --- that flexibility is precisely what lets it handle invoice formats you have never seen before. Every layer below it progressively tightens the guarantees, so that by the time data reaches a Snowflake table, it is in a predictable, validated shape regardless of what the model happened to call any given field on that particular run.

Non-determinism is not a problem to eliminate. It is the price of flexibility, and a reasonable one --- as long as you don't ask the model to be your schema enforcer, your validator, and your business-logic layer all at once. Those jobs belong to Pydantic.

Original article: Shuveb Hussain, "Understanding why deterministic output from LLMs is nearly impossible," Unstract Blog, October 8, 2025. https://unstract.com/blog/understanding-why-deterministic-output-from-llms-is-nearly-impossible/

Knowledge Graph Extraction in Pydantic

Johann Hagerer — Fri, 14 Nov 2025 16:26:15 +0000

In this article, we explore how Pydantic's type system bridges LLM outputs, structured data, and knowledge graph concepts. If you ever wanted to extract a full knowledge graph, but you only know roughly what you want to extract but not how, this article is for you. Whether you're building document processing pipelines, chatbots, or data extraction workflows, understanding these patterns will help you build more robust LLM applications.

Table of contents:

Recap: Structured Output Extraction Using LLMs
Knowledge Graph Concepts
Mapping Knowledge Graphs to Pydantic
How to Extract Relationships
Performing the Actual Prompting
Conclusion

Recap: Structured Output Extraction Using LLMs

The foundation of reliable LLM data extraction is defining clear schemas. Pydantic provides an elegant way to create these schemas using Python's type hints, which can then be converted to JSON Schema for LLM consumption.

Let's start with a simple code example taken from my other article:

from pydantic import BaseModel, Field
import json
from typing import Any, Literal

class Person(BaseModel):
    """A person is a human being with the denoted attributes."""

    name: str = Field(..., 
        description="Which is the name of the person?"
    )
    age: int = Field(..., 
        description="Which is the age of the person?"
    )
    email: str = Field(..., 
        description="Which is the email of the person?"
    )
    country: Literal["Germany", "Switzerland", "Austria"] = Field(..., 
        description="In which country does the person reside?"
    )

json_schema: dict[str, Any] = Person.model_json_schema()
print(json.dumps(json_schema, indent=2))

This code defines a Person entity with three attributes. The Field descriptions provide context to the LLM about what information to extract. This will be provided together with the JSON Schema to the LLM along with prompt so that it knows which information should be extracted. When converted to JSON Schema, you get:

{
  "description": "A person is a human being with the denoted attributes.",
  "properties": {
    "name": {
      "description": "Which is the name of the person?",
      "title": "Name",
      "type": "string"
    },
    "age": {
      "description": "Which is the age of the person?",
      "title": "Age",
      "type": "integer"
    },
    "email": {
      "description": "Which is the email of the person?",
      "title": "Email",
      "type": "string"
    }
  },
    "country": {
      "description": "In which country does the person reside?",
      "title": "Country",
      "type": "string",
      "enum": ["Germany", "Switzerland", "Austria"]
    }
  },
  "required": [
    "name",
    "age",
    "email",
    "country"
  ],
  "title": "Person",
  "type": "object"
}

This JSON Schema is exactly what modern LLM APIs need to constrain their outputs. Now we can use it with an LLM to extract structured data from unstructured text:

# Initialize client
client = Mistral(api_key="your-api-key")

# Make structured output request
response = client.chat.complete(
    model="mistral-large-latest",
    messages=[{
        "role": "user",
        "content": "Extract person info: John Doe is 30 years old, email: john@example.com, resides in Austria."
    }],
    response_format={
        "type": "json_object",
        "schema": Person.model_json_schema()
    }
)

# Parse response into Pydantic model
answer: str = response.choices[0].message.content
person: Person = Person.model_validate_json(answer)
assert json.loads(answer) == person.dict()
print(json.dumps(person.dict(), indent=2))

Resulting JSON string:

{
  "name": "John Doe",
  "age": 30,
  "email": "john@example.com",
  "country": "Austria"
}

Knowledge Graph Concepts

Knowledge graphs offer a powerful mental model for structuring LLM extraction tasks. At their core, knowledge graphs represent information as entities (nodes) connected by relationships (edges). This maps remarkably well to how we structure Pydantic models for LLM outputs.

A knowledge graph is a structured representation of knowledge where:

Entities are the "things" in your domain (people, organizations, products)
Attributes describe properties of entities (name, age, color)
Relationships connect entities together (Person works_at Organization)

For example, consider this simple knowledge graph:

Person: "John Doe"
  - age: 30
  - email: john@example.com
  - works_at -> Organization: "Acme Corp"

Organization: "Acme Corp"
  - founded: 2010
  - industry: "Technology"

This graph contains two entities (John Doe and Acme Corp), several attributes (age, email, founded, industry), and one relationship (works_at).

Mapping Knowledge Graphs to Pydantic

When building LLM extraction pipelines, thinking in knowledge graph terms helps structure your approach:

Competency question: The question in the prompt asked to an LLM to extract an entity, an attribute, or a relationship. Example: "What organizations does this person work for?". This guides what your Pydantic model should capture. In Pydantic, this equals to the description parameter of a Field object.
Ontology: All definitions of entity categories, attributes, and relationships, taken together as the data model of your domain. This is your collection of Pydantic schemas.
Knowledge graph: The instance of your entities and relationships on a concrete set of data. This is the actual extracted and validated data from your documents.

By mapping these concepts, you create a clear separation between schema (ontology) and data (knowledge graph), making your extraction pipeline more maintainable and scalable.

Ontology Definition

Relationships follow the classic triplet pattern: (subject, predicate, object) - for example, (John Doe, works_at, Acme Corp). Before you start extracting, however, you need to know which types of entities, attributes and relationships you are looking for. These inform the ontology. First, we show how to persist the ontology definition as tables. Second, we show how to derive Pydantic BaseModels from it.

Persisting the Ontology

For this tutorial, we define them using three tables of the following schema.

Entity Classes

Field Name	Data Type	Description
ENTITY_NAME	str	Name of the entity class
DESCRIPTION	str	Docstring for the entity class
FIELD_DESCRIPTION	str	Field description for the combined basemodel where we want list[EntityCategory]
ID_ATTRIBUTE	str	What is the id attribute for entity (probably going to remove this from the table)
ID_FIELD_DESCRIPTION	str	What is the field description for the id attribute (probably removing this from the table)
ID	str	Hash of the entity_name

Example:

{
    "entity_name": "DamagedObject",
    "description": "",  # No class-level docstring found; can be added if needed
    "field_description": "List of DamagedObject which have been damaged",
    "id_attribute": "id",
    "id_field_description": "Unique ID for the entity, e.g. damage_01",
}

Attribute Classes

Field Name	Data Type	Description
ATTRIBUTE_NAME	str	Name of the attribute
DTYPE	str	Data type of the attribute
FIELD_DESCRIPTION	str	The competency question for the attribute
ENTITY_NAME	str	Which entity it belongs to
ID	str	Hash of the attribute_name (note: should be based on attribute_name + entity_id)
ENTITY_ID	str	ID of the entity

{
    "attribute_name": "DamageSeverity",
    "dtype": "str",  # No class-level docstring found; can be added if needed
    "field_description": "List of DamagedSeverity attributes telling how severe each damage is.",
    "entity_name": "DamagedObject",
    "id_attribute": "id",
    "id_field_description": "Unique ID for the entity, e.g. damage_01",
    "entity_id": ""
}

Relationship Classes

Field Name	Data Type	Description
SUBJEKT_ENTITAET	str	Subject entity category
BEZIEHUNG_NAME	str	Relationship name
OBJEKT_ENTITAET	str	Object entity category
FIELD_DESCRIPTION	str	Description of the relationship
ID	str	Identifier for the relationship
SUBJEKT_ENTITY_ID	str	ID of the subject entity
OBJEKT_ENTITY_ID	str	ID of the object entity

Deriving Pydantic BaseModels From the Ontology

Example code for creating Pydantic BaseModel classes dynamically for each entity category:

from typing import Any
import polars as pl
from pydantic import BaseModel, create_model

def build_entity_models_(
    entities_df: pl.DataFrame,
    *,
    id_field_strategy: str = "entitaet_id",  # change to 'as_is' or 'prefixed' if you prefer
    id_type: type = str,  # change if your IDs are not strings
) -> dict[str, type[BaseModel]]:
    """Build Pydantic models from a entities dataframe.

    Each dataframe should include columns:
    - entity_name
    - description
    - id_attribute
    - id_field_description
    """
    specs: list[dict[str, Any]] = entities_df.to_dicts()
    models: dict[str, type[BaseModel]] = {}
    for spec in specs:
        doc_string: str = spec.get("DESCRIPTION", "")
        class_name: str = spec["ENTITY_NAME"]
        id_field_name: str = f"{class_name}_id"

        # Define fields: name -> (type, default or FieldInfo)
        fields: dict[str, tuple[Any, Any]] = {
            id_field_name: (id_type, Field(..., description=f"Eindeutige ID für die Entität, z. B {class_name.lower()}_01")),
            "kanonische_bezeichnung": (
                str,
                Field(
                    ...,
                    description="""Kurze, menschenlesbare Standardbezeichnung der Entität für die Anzeige. Aus den informativsten Erwähnungen abgeleitet (z. B. Name + Zusatzinfo) und stabil über mehrere Dokumente hinweg. Nicht als eindeutigen Schlüssel verwenden.

                    Beispiele: "John Smith (geb. 1980-04-12)", "Police #DE-12345-2024", "Fahrzeug [B-AB 1234]"
                    """,
                ),
            ),
            "aliase": (
                list[str],
                Field(
                    ...,
                    description="""Menge aller beobachteten Oberflächenformen (Erwähnungen) dieser Entität aus den Dokumenten, inkl. Namensvarianten (Schreibweisen, Abkürzungen, Titel) und nominalen Verweisformen/Rollenbezeichnungen, sofern sie im Kontext eindeutig auf diese Entität zielen (z. B. "der Vorgesetzte", "der Gutachter"). Dient der Suche und Nachvollziehbarkeit; Originalschreibweise beibehalten, Duplikate entfernen.

                    Beispiele: ["Dr. John", "Herr John Smith", "J. Smith", "der Vorgesetzte", "der Versicherungsnehmer"].
                    """,
                ),
            ),
        }

        Model = create_model(
            class_name,
            __base__=BaseModel,
            __module__="dynamic_models",
            **fields,  # type: ignore[call-overload]
        )
        Model.__doc__ = doc_string
        models[class_name] = Model

    return models

Entity Extraction

Once you have a BaseModel for each type of entity, you need to be able to extract a list of each. In order to do so, you need another BaseModel as entry point, which can be defined as follows:

def build_entity_extraction_model(
    entity_models: dict[str, type[BaseModel]],
    *,
    extraction_class_name: str = "EntitaetenExtraktion",
) -> type[BaseModel]:
    """Creates a Pydantic model, e.g. 'EntitiesExtraction'), whose fields are lists of all the separate entiy models."""

    fields: dict[str, type[BaseModel]] = dict()

    for entity_name, row in entity_models.items():
        desc = row.get("description")
        fields[entity_name] = (list[row], Field(..., description=desc))

    ExtractionModel = create_model(
        extraction_class_name,
        __base__=BaseModel,
        __module__="dynamic_models",
        **fields, # type: ignore[call-overload]
    )
    ExtractionModel.__doc__ = "Container model for extracted entities."
    return ExtractionModel

entity_models = build_entity_models(entity_types_df)
EntityExtraction: type[BaseModel] = build_entity_extraction_model(entity_models, entities_tbl)
entity_extraction_json_schema = EntityExtraction.model_json_schema()

Eventually, the EntityExtraction BaseModel normally can be passed to the structured output extraction API from most LLM providers.

Moving Forward

Extract the entities on a piece of text as described.
Based on the extracted entities and the piece of text, you can extract attributes and relationships in a separate consecutive step.

Conclusion

Combining Pydantic's static typing with knowledge graph thinking provides a robust framework for LLM data extraction. The structured output approach ensures type safety and validation, while knowledge graph concepts help you design comprehensive data models that capture not just entities, but the relationships between them.

As you build more complex LLM applications, this foundation becomes essential for maintaining data quality, enabling downstream analytics, and scaling your extraction pipelines across diverse document types and domains.

A Production LLMOps Architecture for Snowflake

Johann Hagerer — Tue, 11 Nov 2025 23:44:35 +0000

If you've ever hardcoded a prompt, deployed it to production, and then needed to tweak it three weeks later, you know the pain: full code deployments, service restarts, zero rollback capability, and no visibility into which version is actually running. After building LLM-powered insurance claim processing pipelines on Snowflake, I've learned that treating prompts like code is fundamentally wrong—they're artifacts that need independent versioning, deployment, and evaluation strategies. This article shares the complete architecture that solved this: using Snowflake's Model Registry as a prompt registry, deploying via Snowpark Container Services for streaming and Stored Procedures for workflows, and implementing dual evaluation with TruLens and Experiment Tracking. The result? Change prompts without touching application code, A/B test in production with confidence, and maintain full observability across your entire LLM stack—all native to Snowflake.

Architecture Breakdown

┌─────────────────────────────────────────────────────┐
│          PROMPT TEMPLATES in Model Registry         │
│  - Version control for prompts as artifacts         │
│  - Evaluated with: TruLens + ExperimentTracking     │
└────────────────────────────┬────────────────────────┘
                             │
                    ┌────────┴────────┐
                    │     Serving     │
    ┌───────────────▼───────┐  ┌──────▼─────────────┐
    │  Container Services   │  │  Stored Procedures │
    │  - Online inference   │  │  - LLM Workflows   │
    │  - Token streaming    │<─│  - Business logic  │
    │  - FastAPI endpoints  │  │                    │
    │  - Real-time apps     │  │                    │
    └───────────────┬───────┘  └──────┬─────────────┘
                    │    Analyzing    │
                    └────────┬────────┘
                             │
           ┌─────────────────▼────────────────────┐
           │   EVALUATION & OBSERVABILITY         │
           │  - Event tables and OTel for tracing │
           │  - TruLens for trace evaluation      │
           └──────────────────────────────────────┘

Why This Architecture Works

1. Prompt Templates in Model Registry

Treat prompts as versioned artifacts
Reference by semantic version: v1.2.0 instead of hardcoded strings
A/B testing different prompt versions becomes trivial
Rollback capability when a prompt version underperforms

2. Deployment Strategy Based on Use Case**

SPCS for Streaming:

# FastAPI in SPCS
@app.post("/process-claim/stream")
async def stream_claim_analysis(claim_text: str):
    # Load prompt from registry
    prompt_template = registry.get_model("claim_analyzer").version("v2.1")

    # Stream tokens back to user
    async for token in llm_client.stream(
        prompt=prompt_template.render(claim=claim_text)
    ):
        yield token

SPCS for Calling Stored Procedures:

# FastAPI in SPCS
@app.post("/process-claim/llm-workflow")
async def workflow_claim_analysis(claim_text: str):
    # Call stored procedure containing an LLM workflow
    return session.call("claim_analysis_sproc", claim_text)

Stored Procedures for Workflows:

def process_claims_workflow(session: Session, claim_ids: list):
    # Load multiple prompt versions
    extractor = registry.get_model("extractor_prompt").version("v1.2")
    classifier = registry.get_model("classifier_prompt").version("v2.0")

    # Sequential processing with business logic
    for claim_id in claim_ids:
        extracted = extract_with_prompt(extractor, claim_id)

        # Business logic between LLM calls
        if extracted['amount'] > 100000:
            classification = classify_high_value(classifier, extracted)
        else:
            classification = classify_standard(classifier, extracted)

        # Structured output to table
        save_to_snowflake(claim_id, classification)

3. Dual Evaluation Strategy

Prompt Template Level:

# Experiment Tracking for prompt engineering
with ExperimentTracking(name="claim_extractor_prompts") as exp:
    # Test different prompt versions
    for version in ["v1.0", "v1.1", "v2.0"]:
        prompt = registry.get_model("extractor").version(version)
        results = evaluate_on_test_set(prompt)
        exp.log_metrics({
            f"{version}_accuracy": results['accuracy'],
            f"{version}_latency": results['latency']
        })

# TruLens for prompt quality
tru_prompt = TruCustomApp(prompt_evaluator, app_id="prompt_v2.0")
feedback = tru_prompt.run_feedback_functions(test_cases)

Workflow Level:

from snowflake import telemetry
def claim_processing_workflow(claim_data):
    telemetry.set_span_attribute("example.proc.do_tracing", "begin")

    extracted = extract_claims(claim_data)
    classified = classify_claims(extracted)

    telemetry.add_event(
        "event_with_attributes", 
        {
            "example.extracted": extracted,
            "example.classified": classified 
        }
    )

    return summarize_claims(classified)

# Traces go to Snowflake AI Observability

Additional Considerations

Version Pinning Strategy:

# In stored procedure, pin versions explicitly
PROMPT_VERSIONS = {
    'extractor': 'v1.2.0',
    'classifier': 'v2.1.0',
    'summarizer': 'v1.0.1'
}

# This makes your workflow reproducible and testable

Migration Path: You could even start with stored procedures and promote successful workflows to SPCS when you need real-time:

Develop workflow in stored procedure (easier to iterate)
Test with TruLens
Once stable, wrap in FastAPI container for SPCS
Same Model Registry artifacts, different deployment

Cost Optimization:

SPCS: Pay for uptime (good for high-frequency, low-latency)
Stored Procedures: Pay per execution (good for batch, scheduled jobs)

This architecture gives you the flexibility to choose the right deployment model without rewriting your prompts or evaluation logic.

Are you planning to document this architecture in your Medium article? This would be incredibly valuable for the community - I haven't seen many people thinking about LLM deployments on Snowflake with this level of architectural maturity.

Managing LLM Prompts With Snowflake Model Registry

Johann Hagerer — Sun, 09 Nov 2025 17:11:05 +0000

Large language models are transforming how we build applications, but managing prompts across different environments and versions can quickly become chaotic. What if you could version your prompts the same way you version your ML models? In this article, I'll show you how to leverage Snowflake's Model Registry to create a robust, versioned prompt management system that treats your prompts as first-class artifacts.

Setting Up Your Prompt Configuration

Before we can register anything, we need to define our prompt and its parameters. Here, we focus on structured output prompts which return a predefined JSON schema given by a Pydantic BaseModel.

import pydantic
from typing import Literal

# Define the prompt template with placeholder for dynamic content
prompt_template = """Classify the type of text given as follows: {given_text}"""

# List of parameters that will be injected into the prompt
prompt_parameters = ["given_text"]

# Version control for your prompt
prompt_name = "text_classification"
prompt_version = "v1"

# LLM configuration parameters
temperature = 0.0
max_tokens = 15_000
model = "claude-4-0-sonnet"

# Define the expected output structure using Pydantic
class StructuredOutput(pydantic.BaseModel):
    text_category: Literal["scientific_paper", "newspaper_article"] = pydantic.Field(
        description="Is the given text a newspaper article ('newspaper_article') or a scientific paper ('scientific_paper')?",
    )

Sources:

Serialize the Prompt Configuration

The model context is where we serialize our prompt configuration into files that Snowflake can track. This step is crucial because it transforms your prompt from ephemeral code into persistent artifacts.

import json
from snowflake.ml.model import custom_model

config_file: str = "config.json"
prompt_template_file: str = "prompt_template.txt"
json_schema_file: str = "json_schema.json"

LlmConfig = dict[str, str | float | int | bool]  # stupid pandas limitation

# Bundle all LLM parameters into a single configuration dictionary
llm_params: LlmConfig = dict(
    prompt_name=prompt_name,
    prompt_version=prompt_version,
    temperature=temperature,
    max_tokens=max_tokens,
    model=model,
    prompt_parameters=json.dumps(prompt_parameters),
    json_schema_file=json_schema_file,
    prompt_template_file=prompt_template_file,
)

# Write configuration to disk for model context
with open(config_file, "w") as f:
    json.dump(llm_params, f)

# Save the prompt template as a separate file for easy editing
with open(prompt_template_file, "w") as f:
    f.write(prompt_template)

# Export the Pydantic schema for structured output validation
with open(json_schema_file, "w") as f:
    json.dump(StructuredOutput.model_json_schema(), f, indent=2)

# Create model context that bundles all files together
model_context = custom_model.ModelContext(
    config_file=config_file,
    prompt_template_file=prompt_template_file,
    json_schema_file=json_schema_file,
)

Sources:

PyCaret Example on GitHub

Building the Custom Prompt Model

Now we implement the actual model class that will handle prompt generation. This class reads your configuration and creates properly formatted prompts with all the necessary metadata attached.

import pandas as pd
import polars as pl
import json
from snowflake.ml.model import custom_model

class PromptModel(custom_model.CustomModel):
    def __init__(self, context: custom_model.ModelContext) -> None:
        super().__init__(context)

        # Load all configuration files from the model context
        config_file = self.context.path("config_file")
        json_schema_file = self.context.path("json_schema_file")
        prompt_template_file = self.context.path("prompt_template_file")

        with open(config_file) as f:
            self.config: LlmConfig = json.load(f)

        with open(json_schema_file) as f:
            self.json_schema = f.read()

        with open(prompt_template_file) as f:
            self.prompt_template = f.read()

# Instantiate the model with our context
prompt_model = PromptModel(model_context)

Sources:

Bring your own model types via serialized files | Snowflake Documentation

Testing Your Prompt Model

Before registering anything, let's verify that our prompt model works correctly. This step ensures you're not committing broken code to your registry.

import polars as pl

# Create sample input data
df = pl.DataFrame({
    "given_text": [
        "Lorem ipsum ..."
    ]
})

# Generate prompts from the input data
prompts_df = prompt_model.create_prompts(df.to_pandas())
prompts_df

Defining the Model Signature

The model signature tells Snowflake what inputs your model expects and what outputs it produces. This is essential for type safety and model validation.

from snowflake.ml.model.model_signature import infer_signature

# Automatically infer the signature from sample input/output
sig = infer_signature(
    input_data=df.to_pandas(),
    output_data=prompts_df
)
sig

Sources:

Specifying model signatures | Snowflake Documentation

Registering to Snowflake Model Registry

This is where the magic happens---we're logging our prompt model to the registry, making it discoverable and versionable alongside your other ML artifacts.

import snowflake.snowpark as sp
from snowflake.ml.registry import Registry

# Get the active Snowflake session
session = sp.context.get_active_session()
snowml_registry = Registry(session)

# Log the model to the registry with all metadata
custom_mv = snowml_registry.log_model(
    prompt_model,
    model_name=prompt_name,
    version_name=prompt_version,
    conda_dependencies=["polars"],  # Specify runtime dependencies
    options={"relax_version": False},  # Enforce exact version matching
    signatures={"predict": sig},
    comment = 'A prompt for KFZ liability completeness check',
)

Adding Custom Metadata

Finally, we attach our full configuration as metadata to the model version. This creates a complete audit trail of all parameters used for this specific prompt version.

import json

# Serialize configuration as JSON string
metadata_dict = json.dumps(llm_params)

# Attach metadata to the model version using SQL
session.sql(f"""
    ALTER MODEL {prompt_name} MODIFY VERSION {prompt_version}
    SET METADATA = $${metadata_dict}$$
""").collect()

With this approach, you now have a fully versioned, auditable prompt management system built directly into your Snowflake infrastructure. No more hunting through git commits to find which prompt was used in production---it's all tracked in the same place as your models.

In future articles, I'll explore how to integrate this prompt management approach with Snowflake's Cortex LLM service for seamless inference, and how to leverage feature views to establish proper data lineage between your source data, prompt versions, and model predictions. This will complete the picture of end-to-end MLOps for LLM applications within the Snowflake ecosystem.

Collaborative GenAI Projects - Simple Best Practices

Johann Hagerer — Sat, 25 Oct 2025 06:45:11 +0000

This article should provide a simple template in case you want to experiment with LLMs workflows. As an example, we load unstructured data, such as, PDF files, as structured data into tables using Python. As a preparation, you might want to read this article from myself before you go ahead: LLM Coding Concepts: Static Typing, Structured Output, and Async.

Project Directory Structure

project-name/
├── data/
│   ├── logs/
│   │   └── {username}.jsonl    # logs from your LLM calls, append only
│   ├── experiments/
│   │   └── {username}.jsonl 
│   ├── traces/
│   │   └── {username}.jsonl    # traces from your function calls, append only
│   ├── raw/
│   │   ├── document01.pdf
│   │   ├── document02.pdf
│   │   └── ...
│   └── transformed/
│      ├── 01_extracted_texts.parquet
│      └── 02_texts_with_markup.parquet
├── data_ops/
│   ├── helpers/
│   └── transformations/        # scripts to convert one dataset to another one
│      ├── 01_extract_texts.py
│      └── 02_add_markup.py
├── llm_ops/
│   ├── helpers/
│   │   └── tool_calls.py       # contains database calls on the parquet files e.g.
│   ├── steps/
│   │   ├── processing_step_1   # rename accordingly
│   │   │   ├── config.py       # LLM parameters
│   │   │   ├── prompt.py       # the prompt template
│   │   │   ├── base_model.py   # a Pydantic BaseModel for structured output defintion
│   │   │   ├── run.py          # a run function putting it all together
│   │   │   └── report.ipynb    # a Jupyter notebook to develop and evaluate the prompt
│   │   └── processing_step_2   # ....
│   └── workflows/
│      ├── workflow_1
│      |  ├── run.py            # imports and uses several LLM steps and helpers
│      |  └── report.ipynb      # imports and uses several LLM steps and helpers
│      └── agentic_workflow_1
│         └── run.py            # imports and uses several LLM steps and helpers
│         └── report.ipynb      # imports and uses several LLM steps and helpers
├── notebooks/
│   └── playground.ipynb        # from here you can run whole transformations
├── streamlit_app/
│   ├── main.py
│   └── helpers.py              # might contain data access to pre-processed document tables
├── README.md/
└── pyproject.toml/

The llm_ops steps and workflows contain run files which contain run() functions. For LLM steps, these receive the prompt parameters as inputs and yield the raw LLM outputs. For LLM workflows, these can call several LLM steps and implement business logic to combine them.

To develop and improve prompts and workflows iteratively, you perform prompt engineering in the report.ipynb file of the respective step or workflow directory. You can leave the prompt history in an archive subfolder. You may start experimenting by creating a dataset inside of the notebook. When finished, you may consider moving the prompt development dataset to the data/datasets directory as a JSON file.

The data_ops directory contains transformation scripts. These run llm_ops functions on whole tables, resulting in corresponding LLM output tables saved in the data directory.

The data directory contains large data files which should be added to Git using Git large file system, abbreviated Git LFS:

To initialize the whole project with a managed virtual environment and a pyproject.toml containing all package dependencies, it is advised to use uv:

Install uv.
Run uv init.
Add packages via uv add polars duckdb ...

Logging & Tracing

Logs are used to keep track of prompt engineering, i.e., to calculate metrics, such as, accuracy or LLM-as-a-judge scores. Traces are used for debugging LLM workflows, especially to see in which sequence which functions where called with which arguments and how much time it took respectively.

The difference between an LLM log entry and a function trace is that a trace keeps track of the following aspects in addition to the raw LLM inputs and outputs:

A trace has parent child relationships of function calls to know which function called which other functions.
A trace has all raw function parameters.
A log has more specialized LLM parameters, such as, model name.
A log contains also the target label to calculate classification metrics.

Logs

For logs, it is advised to save the following properties for each LLM call, such that you can calculate metrics, such as, accuracy, F1, coherence, et cetera afterwards.

class LlmLogEntry(pydantic.BaseModel):
    id: str              # unique UUID for this entry 
    experiment_id: str   # unique id for the experiment 
    start_time: str
    end_time: str
    duration: float
    prompt_template: str # The raw prompt template without parameters replaced 
    prompt_parameters: dict[str, str]  # The prompt parameters to be inserted into the prompt
    llm_output: dict | str
    provider_name: str   # the LLM provider name (Mistral/OpenAI/...)
    model_name: str      # the LLM used
    temperature: float   # temperature used for the LLM
    top_p: float         # top_p param used for the LLM
    json_schema: str     # The structured output schema for this LLM call
    labels: dict[str, Any] # The gold standard for the output JSON fields, in case classification was performed
    dataset_name: str    # the name of the dataset from which the sample has been drawn
    dataset_version: str # the version of the dataset

An LLM log is saved when you run an experiment when the LLM is called with the respective prompt.

You can convert the LLM experiment logs to a report with custom metrics, such that you see the prompt accuracy:

import polars as pl

df = pl.read_ndjson("data/logs/jhr.jsonl")

df = df.group_by([
    "experiment_id", 
    "prompt_template",
    "temperature",
    "top_p",
    "provider_name",
    "model_name",
    "json_schema",
    "dataset_name",
    "dataset_version"
]).agg(
    sum_input_chars = pl.sum(pl.col("prompt").str.len_chars()),
    accuracy = pl.sum(
        pl.col("labels") == pl.col("llm_output")
    )
)

df.write_ndjson("data/experiments/jhr.jsonl")

Traces

For traces, it is advised to save the following properties for each function call along the stack, such that you can perform runtime analyses and error tracking.

Traces are an advanced concept helpful for tool calling. It can be depriorized against logs.

class TraceEntry(pydantic.BaseModel):
    id: str                   # unique UUID for this entry 
    run_id: str               # unique UUID for this whole run 
    start_time: str | None
    end_time: str | None
    duration: float | None
    parent_id: str | None     # The id of the calling function call
    func_name: str | None
    output_dict: dict | None
    exception_stacktrace: str | None
    kwargs: str | None        # keyword arguments passed to this function call
    args: str | None          # normal parameters passed to this function call 
    json_schema: str | None   # The structured output schema for this LLM call
    username: str | None

Experiment Tracking

First, you define a generic ExperimentDefinition base class that encapsulates everything needed to run and evaluate an LLM experiment: the prompt template and its version, typed input/output models, LLM configuration, and the dataset with its annotations. By making it generic over InputType and OutputType, you can reuse this structure across different tasks while keeping inputs, predictions, and annotations fully typed. The predict method iterates over the inputs and collects structured outputs, while calc_metrics computes classification metrics for the fields you specify --- giving you a self-contained, reproducible experiment object that can be serialized in one call.

from pydantic import BaseModel, Field
import polars as pl
from typing import ClassVar, Any, TypeVar, Generic, Optional

InputType = TypeVar("InputType", bound=BaseModel)
OutputType = TypeVar("OutputType", bound=BaseModel)

class ExperimentDefinition(BaseModel, ABC, Generic[InputType, OutputType]):
    """Base class for experiments."""

    task_name: str
    prompt_template_str: str
    prompt_template_version: str
    input_model: type[InputType]
    output_model: type[OutputType]
    llm_model: str = "claude-4-sonnet"
    temperature: float = 0.0
    max_tokens: int = 20_000
    inputs: list[InputType]
    predictions: list[OutputType] = []
    annotations: list[OutputType] = []
    dataset_version: str
    classification_tasks: list[str] = []
    run_id: str = Field(
        default_factory=lambda: datetime.now().strftime("%y%m%d%H%M%S")
    )
    metrics: dict[str, float]

    def predict(self) -> list[OutputType]:
        for x in self.inputs:
            self.predictions.append(self._predict(x))
        return self.predictions

    def _predict(self, input: InputType) -> OutputType:
        pass # TODO: your own predict LLM function call

    def calc_metrics(self) -> None:
        for task in classification_tasks:
            pass

Then, you put this into practice for a concrete task. You define your Input and Output models --- here for sentiment analysis --- and instantiate the experiment with your prompt, dataset samples, gold-standard annotations, and the list of output fields to evaluate. After calling predict and calc_metrics, the entire experiment, including configuration, predictions, and metrics, is appended as a single JSON line to a user-specific experiments file. This makes it straightforward to compare runs across prompt versions, models, or dataset revisions.


class Input(BaseModel):
    some_text: str

class Output(BaseModel):
    sentiment: Literal["positive", "negative"]
    reason: str

exp_def = ExperimentDefinition(
    task_name="sentiment_analysis",
    prompt_template_str="Tell the sentiment of the following text and give a reason: \n{some_text}",
    prompt_template_version="v16",
    input_model=Input,
    response_model=Output,
    inputs=[
        Input(some_text="What a happy day!"),
        Input(some_text="What a sadday!"),
    ]
    annotations=[
        Output(sentiment="positive", reason=""),
        Output(sentiment="negative", reason=""),
    ]
    dataset_version="v11",
    classification_tasks=["sentiment"], # maps to OutputModel field names
    llm_model="llama3.1-8b",
)
# execute your experiment...
...

exp_def.predict()
exp_def.calc_metrics()

with open(f"./data/experiments/{username}.json", "a") as f:
    f.write(exp_def.model_dump_json())

Forem: Johann Hagerer

Presentation Slides for #StopTheSlop

Stop the Slop!

It's Official.

The Numbers

curl Rage-Quit

Code Reviews Gone Wrong

What Can You Actually Do?

What Can You Actually Do?

What Can You Actually Do?

Spec-Driven Development Based on DSPI: Design-Specify-Plan-Implement

Abstract

Table of Contents

1. Project Setup

Folder Structure

doc/architecture/architecture.md

doc/architecture/coding_style.md

doc/architecture/documentation_style.md

doc/architecture/tech_stack.md

doc/architecture/project_structure.md

2. Per-Ticket Workflow

Create the ticket folder

doc/issues/<ticket-id>/requirements.md

doc/issues/<ticket-id>/arch.md

doc/issues/<ticket-id>/specs.md

doc/issues/<ticket-id>/*task.md

3. Cortex Code Skills

doc/.cortex/skills/dspi-design/SKILL.md

A Brief Terminology of Insurance Claims Adjustment

How Should Students Document AI Usage in Academic Work?

The Core Philosophy: Responsibility and Transparency

What Documentation Actually Looks Like

The Appropriate vs. Inappropriate Divide

Assessment: Quality Over Origin

Practical Documentation Examples

A Taxonomy Worth Stealing

Why This Matters Beyond Academia

LLM Non-Determinism: What Providers Guarantee, and How to Build Around It

Motivation

What Providers Actually Promise

Best Practices for Structured Extraction

Enforce Structure with response_format and Pydantic

Write Unambiguous Schemas

Anchor Behavior with Few-Shot Examples

Measure Variance to Drive Improvement

The Right Mental Model

Knowledge Graph Extraction in Pydantic

Recap: Structured Output Extraction Using LLMs

Knowledge Graph Concepts

Mapping Knowledge Graphs to Pydantic

Ontology Definition

Persisting the Ontology

Entity Classes

Attribute Classes

Relationship Classes

Deriving Pydantic BaseModels From the Ontology

Entity Extraction

Moving Forward

Conclusion

A Production LLMOps Architecture for Snowflake

Architecture Breakdown

Why This Architecture Works

1. Prompt Templates in Model Registry

2. Deployment Strategy Based on Use Case**

3. Dual Evaluation Strategy

Additional Considerations

Managing LLM Prompts With Snowflake Model Registry

Setting Up Your Prompt Configuration

Serialize the Prompt Configuration

Building the Custom Prompt Model

Testing Your Prompt Model

Defining the Model Signature

Registering to Snowflake Model Registry

Adding Custom Metadata

Collaborative GenAI Projects - Simple Best Practices

Project Directory Structure

Logging & Tracing

Logs

Traces

Experiment Tracking

Enforce Structure with `response_format` and Pydantic