Forem: kasi viswanath vandanapu

SQL Comparison Library Architecture

kasi viswanath vandanapu — Wed, 01 Apr 2026 03:58:10 +0000

Purpose

Design a deterministic-first library that compares SQL and results on one database instance, explains mismatches, and optionally adds AI judgment.

The library evaluates:

user_query
actual_sql
expected_sql
actual_result
expected_result

The library does not claim universal semantic equivalence. It reports behavior on the evaluated database context.

Design Principles

Deterministic before AI: deterministic metrics are the source of truth; AI is advisory.
Structure and result are separate: SQL form and output correctness are scored independently.
Diagnostics over raw score: every important score must include a reason and evidence.
Configurable semantics: set, multiset, ordered, and numeric tolerance modes are first-class.
Production-safe outputs: return machine-friendly schema and human-readable explanations.

High-Level Architecture

The system is organized into five layers:

Validation
Structural comparison
Result comparison
Diagnostic attribution
Optional AI judge

┌─────────────────────────────────────────────────────────────────────┐
│  INPUT                                                              │
│  user_query · actual_sql · expected_sql                             │
│  actual_result · expected_result                                    │
└──────────────────────────┬──────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────────┐
│  LAYER 1 · Validation                                               │
│  ├── Parse actual_sql → AST                                         │
│  ├── Parse expected_sql → AST                                       │
│  ├── Canonicalize SQL strings                                       │
│  ├── Execute both queries in a read-only sandbox                    │
│  ├── Capture DML errors · wall-clock timings · schemas              │
│  └── Materialize result sets                                        │
└───────────┬─────────────────────────────┬───────────────────────────┘
            │ parse + exec ok                 │ parse or exec fail
            ▼                                 │
┌───────────────────────────────────────────╒─────────────────────────┐
│  LAYER 2 · Structural Comparison           ┃                         │
│  ├── normalized_sql_match  [boolean gate]   ┃                         │
│  ├── clause_match          [diagnostic]     ┃                         │
│  └── clause_weighted_distance  [score]      ┃                         │
└───────────┬────────────────────────────────┘                         │
            ▼                                                           │
┌─────────────────────────────────────────────────────────────────────┤
│  LAYER 3 · Result Comparison                                         │
│  ├── result_equality_family  (exact / order-insensitive / ordered)   │
│  ├── schema_match                                                    │
│  ├── row_overlap · cell_overlap  [graded quality]                    │
│  ├── cardinality_match                                               │
│  ├── numeric_tolerance_match  (atol / rtol)                          │
│  └── null_handling_match  (3VL)                                      │
└───────────┬─────────────────────────────────────────────────────────┤
            ▼                                                           │
┌─────────────────────────────────────────────────────────────────────┤
│  LAYER 4 · Diagnostic Attribution                                    │
│  ├── projection / filter / join error scores                         │
│  ├── grouping / aggregate function error scores                      │
│  ├── ordering flag · top-k score · cardinality explosion flag        │
│  ├── per_column_mismatch_map  [explanation layer]                    │
│  └── confidence · evidence_strength · ambiguous_case                │
└───────────┬─────────────────────────────────────────────────────────┤
            ▼                                                           │
┌─────────────────────────────────────────────────────────────────────┤
│  LAYER 5 · Optional AI Judge                                         │
│  ├── intent_adequacy_judgment                                        │
│  ├── database_scoped_equivalence_judgment                            │
│  ├── acceptable_deviation_judgment                                   │
│  ├── short_rationale · evidence_bullets · decision_summary           │
│  ├── severity_classification                                         │
│  └── confidence · evidence_strength · ambiguous_case                │
└───────────┬─────────────────────────────────────────────────────────┘
            │─────────────────────────────► parse or exec fail path ►┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────────┐
│  FINAL OUTPUT                                                        │
│  deterministic_verdict · ai_advisory_verdict                        │
│  verdict_consistency_flag · overall_score · severity                │
│  structure_score · result_score · diagnostic_scores                 │
│  explanations · mismatch_map · warnings                             │
└─────────────────────────────────────────────────────────────────────┘

1) Validation Layer

Responsibilities:

Validate the request payload and runtime config.
Parse actual_sql and expected_sql into ASTs using a dialect-aware SQL parser.
Canonicalize SQL strings for structural comparison.
Execute both queries in a read-only execution sandbox.
Capture DML runtime errors, wall-clock timings, materialized result sets, and output relation schemas.

Output type: booleans + errors

Internal intermediates vs external outputs:

ast_actual, ast_expected, actual_result_table, expected_result_table, actual_schema, and expected_schema are internal intermediates consumed by downstream layers. They are not included in the final report unless explicitly requested via config.
All other Validation outputs (parse_success, execution_success, execution_error, execution_time, normalized_sql) are external-facing and appear in the final report.

Primary outputs:

parse_success_actual: True if actual_sql produces a valid AST with no syntax errors.
parse_success_expected: True if expected_sql produces a valid AST with no syntax errors.
normalized_sql_actual: Canonicalized SQL string of actual_sql after normalization.
normalized_sql_expected: Canonicalized SQL string of expected_sql after normalization.
ast_actual: Abstract Syntax Tree (AST) of actual_sql.
ast_expected: Abstract Syntax Tree (AST) of expected_sql.
execution_success_actual: True if actual_sql executes without a DML runtime error.
execution_success_expected: True if expected_sql executes without a DML runtime error.
execution_error_actual: Error taxonomy category + raw DB error message for actual_sql, if any.
execution_error_expected: Error taxonomy category + raw DB error message for expected_sql, if any.
execution_time_actual_ms: Wall-clock execution time of actual_sql in milliseconds.
execution_time_expected_ms: Wall-clock execution time of expected_sql in milliseconds.
actual_result_table: Materialized result set (tuples) returned by actual_sql.
expected_result_table: Materialized result set (tuples) returned by expected_sql.
actual_schema: Output relation schema of actual_sql — attribute names and data types.
expected_schema: Output relation schema of expected_sql — attribute names and data types.

2) Structural Comparison Layer

Responsibilities:

Compare canonicalized SQL strings as a fast boolean gate.
Perform per-clause AST comparison across all SQL clauses.
Compute a clause-weighted structural divergence score.

Output type: scores + clause diffs

Metric role hierarchy:

normalized_sql_match: fast boolean gate — if true, deeper structural comparison can be skipped.
clause_match: diagnostic detail — reveals per-clause divergence regardless of string match outcome.
clause_weighted_distance: structural score — aggregated from per-clause weighted comparison.

Primary outputs:

normalized_sql_match: Boolean gate — True if canonicalized SQL strings are identical after normalization.
clause_match: Per-clause AST comparison across SELECT (projection), WHERE (selection predicate), HAVING (post-aggregation predicate), JOIN (join graph and join type), GROUP BY (aggregation grain), ORDER BY, LIMIT/OFFSET (top-k), DISTINCT, WINDOW/OVER, and set operators (UNION/INTERSECT/EXCEPT).
clause_weighted_distance: Clause-weighted structural divergence score; 0 = identical, higher = more divergent.

3) Result Comparison Layer

Responsibilities:

Compare materialized result sets under configurable set, bag (multiset), or ordered semantics.
Evaluate relation schema alignment, graded tuple-level and cell-level overlap, cardinality delta, three-valued logic (3VL) NULL consistency, and numeric tolerance.

Output type: scores + mismatch maps

Result equality family:

Use one metric family, result_equality_family, with an explicit comparison_mode.
comparison_mode selects evaluation semantics: exact / order-insensitive / order-sensitive.
result_equality_family.mode_pass is the single boolean pass/fail for the selected mode.
Optional mode_details can include non-primary mode outcomes for debugging, but they do not act as independent peer metrics.
row_overlap and cell_overlap are graded quality metrics measuring partial match degree.
per_column_mismatch_map is the explanation layer and lives in Layer 4 Diagnostic Attribution.

Primary outputs:

result_equality_family: Mode-driven result comparison object:
- comparison_mode: exact | order-insensitive | order-sensitive.
- mode_pass: Boolean verdict for the selected comparison_mode.
- mode_details (optional): non-primary mode outcomes retained for diagnostics only.
schema_match: Relation schema comparison — attribute count, attribute names, data types, and optional attribute ordering.
row_overlap: Graded quality — tuple-level overlap expressed as Jaccard similarity or F1 score (precision + recall over shared tuples).
cell_overlap: Graded quality — value-level (attribute × tuple) overlap after alignment; measures localized mismatch granularity.
cardinality_match: Cardinality comparison — absolute row-count delta and relative cardinality ratio between result sets.
numeric_tolerance_match: Floating-point and decimal comparison using atol and rtol thresholds.
null_handling_match: True if NULL placement and three-valued logic (3VL) behavior is consistent across result sets.

4) Diagnostic Attribution Layer

Responsibilities:

Attribute result mismatch to likely root SQL clauses using heuristic diagnostic scoring.
Emit clause-level error scores with confidence and evidence strength.
Detect ordering-only mismatch and cardinality fanout or compression anomalies.
Produce the per-attribute mismatch explanation map.

Output type: error categories + evidence

Every diagnostic output includes: likely source clause, confidence (0.0–1.0), evidence_strength (low / medium / high), and ambiguous_case flag when multiple root clauses are plausible.

Primary outputs:

projection_error_score: Estimated impact of differences in the SELECT projection list — missing, extra, or transformed attributes.
filter_error_score: Estimated impact of predicate differences in WHERE (selection) and HAVING (post-aggregation filter).
join_error_score: Estimated impact of join graph topology, join predicate (ON condition), or join type differences (INNER / LEFT / RIGHT / FULL / CROSS / ANTI / SEMI).
grouping_error_score: Estimated impact of incorrect aggregation grain in the GROUP BY clause.
aggregate_function_error_score: Estimated impact of wrong aggregate function (SUM vs COUNT vs AVG vs MIN vs MAX vs STDDEV).
ordering_error_flag: True when mismatch is caused solely by tuple ordering; bag-semantics result sets are equal.
limit_topk_error_score: Estimated impact of LIMIT, TOP, FETCH FIRST, OFFSET, or ROW_NUMBER top-k differences.
cardinality_explosion_flag: True when result set cardinality deviates significantly from expected due to join fanout or over-filtering.
per_column_mismatch_map: Explanation layer — attribute-level mismatch rate and severity breakdown explaining row_overlap and cell_overlap.
confidence: 0.0–1.0 confidence in the primary diagnostic attribution.
evidence_strength: low / medium / high — based on corroborating signal count and quality.
ambiguous_case: True when multiple clauses are plausible root causes with similar evidence weight.

5) Optional AI Judge Layer

Responsibilities:

Apply LLM-as-judge evaluation grounded strictly in deterministic metric outputs.
Assess natural-language intent adequacy relative to the user query.
Judge database-scoped equivalence as observed on this database instance — not universal semantic equivalence.
Classify deviation acceptability using domain and business context.
Never issue a verdict that contradicts the deterministic verdict from Layers 1–4.
Consume the optional context field from the Request Model (schema metadata and business constraints) to ground judgment in domain-specific rules.

Output type: advisory label + rationale

Verdict structure — the final report separates three fields:

deterministic_verdict: authoritative pass/fail derived from Layers 1–4.
ai_advisory_verdict: the AI judge's advisory classification.
verdict_consistency_flag: True if both verdicts agree; False when they diverge, which triggers a consistency warning.

AI output format — internal reasoning traces are never exposed; instead:

short_rationale: one-to-two sentence summary of the AI judge's reasoning.
evidence_bullets: ordered list of key deterministic signals the AI judge used.
decision_summary: final label and justification only.

Primary outputs:

intent_adequacy_judgment: LLM-as-judge assessment of whether actual_sql satisfies user_query intent as well as expected_sql.
database_scoped_equivalence_judgment: LLM assessment of observed equivalence on this database instance for the given task — does not claim universal semantic equivalence.
acceptable_deviation_judgment: Grounded classification of whether mismatch is harmless (e.g., ordering-only), acceptable (minor rounding), or critical (wrong data).
short_rationale: One-to-two sentence summary of the AI judge's reasoning.
evidence_bullets: Ordered list of key deterministic signals used in the AI judgment.
decision_summary: Final label and justification without internal reasoning traces.
severity_classification: Severity triage label: pass / minor issue / moderate issue / major issue / critical failure.
confidence: 0.0–1.0 confidence in the AI judgment.
evidence_strength: low / medium / high.
ambiguous_case: True when the AI judge cannot distinguish between two plausible classifications.

Core Components

A. Request Model

Input contract for one evaluation run.

Suggested fields:

user_query: string
actual_sql: string
expected_sql: string
config: evaluation options
context: optional schema metadata and business constraints

B. Execution Adapter

DB adapter abstraction to support one configured database engine per run.

Interface shape:

execute(sql) -> execution metadata + materialized result
explain(sql) -> optional plan metadata

Sandbox isolation:

Queries run inside a read-only transaction (or equivalent read-only session mode) to prevent side-effects.
A configurable query timeout aborts long-running executions.
On completion or timeout, the transaction is rolled back to ensure no persistent state changes.

C. SQL Parser and Normalizer

Responsibilities:

Parse SQL into an Abstract Syntax Tree (AST) using a dialect-aware parser (e.g., sqlglot).
Canonicalize SQL string: lowercase keywords, collapse whitespace, normalize alias formatting.
Normalize commutative expressions (e.g., AND operand ordering) where semantics are preserved.

D. Structural Comparator

Responsibilities:

Per-clause AST matcher across projection (SELECT), selection predicate (WHERE), join graph, aggregation grain (GROUP BY), post-aggregation predicate (HAVING), ordering (ORDER BY), top-k (LIMIT/OFFSET), and set operators.
Clause-weighted distance calculator producing a normalized structural divergence score.

E. Result Comparator

Responsibilities:

Relation schema comparator (attribute names, data types, attribute order).
Ordered and unordered tuple comparators for set and bag (multiset) semantics.
Cell-level (attribute × tuple) alignment and value matching.
Floating-point comparator using atol/rtol tolerance thresholds.
NULL-aware comparator respecting three-valued logic (3VL) semantics.

F. Diagnostic Engine

Responsibilities:

Convert raw structural and result divergence signals into clause-attributed error scores.
Build per-attribute mismatch profiles from the output relation.
Generate deterministic, evidence-backed explanations without LLM involvement.

G. Scoring and Severity Engine

Responsibilities:

Compute category scores.
Aggregate weighted overall score.
Map to severity labels.

H. Report Builder

Responsibilities:

Emit final output schema.
Include evidence, warnings, and plain-language explanations.

I. Optional AI Judge Adapter

Responsibilities:

Construct grounded LLM-as-judge prompts that embed deterministic metric evidence.
Return structured AI judgments with confidence scores.
Record model identifier, prompt version, and token usage for reproducibility.

Metric Processing Pipeline

Recommended sequence:

Validation (parse checks + execution)
Structural comparison metrics
Result comparison metrics
Diagnostic attribution metrics (computed after structural and result evidence is available)
Optional performance metrics
Optional AI judge metrics
Final synthesis — compute deterministic_verdict, run AI judge, emit verdict_consistency_flag

If either query fails parse or execution, continue producing partial diagnostics where possible and return explicit blocked-reason fields.

Failure-State Precedence Rules

Apply deterministic precedence in this exact order during final synthesis:

Parse failure precedence: if parse_success_actual = false or parse_success_expected = false, set deterministic_verdict = fail, set severity = critical failure, and mark downstream structural/result comparisons as blocked.
Execution failure precedence: if both parse checks pass but execution_success_actual = false or execution_success_expected = false, set deterministic_verdict = fail, set severity = critical failure, and mark result comparison fields as blocked.
Result-family precedence: if validation passes, evaluate result_equality_family.mode_pass for the configured comparison_mode. A false mode_pass sets deterministic_verdict = fail regardless of advisory AI output.
Structural/diagnostic precedence: when result equality fails, use structure and diagnostic signals only for attribution and severity refinement; they never override deterministic fail to pass.
Advisory precedence: ai_advisory_verdict can only annotate. It cannot overturn deterministic_verdict in any failure state.
Performance metric precedence: if execution_success_actual = false or execution_success_expected = false, performance metrics (execution_time_ratio, resource_cost_score, plan_complexity_score) are blocked and set to null.

Configuration Model

Key runtime options:

comparison_mode: exact / order-insensitive / order-sensitive
numeric_abs_tolerance: float
numeric_rel_tolerance: float
null_equality_mode: strict/sql_semantic
require_column_name_match: true/false
require_column_order_match: true/false
max_rows_for_cell_alignment: integer
query_timeout_ms: integer (maximum wall-clock time per query execution before abort)
enable_ai_judge: true/false
enable_performance_metrics: true/false

Comparison mode mapping:

exact: schema, values, and row order must all match.
order-insensitive: bag (multiset) semantics — duplicates count, row order ignored.
order-sensitive: full ordered comparison including tuple sequence; required for top-k and pagination queries.

Suggested Output Contract

Top-level fields:

deterministic_verdict: authoritative pass/fail derived from Layers 1–4.
ai_advisory_verdict: advisory classification from the AI judge (optional).
verdict_consistency_flag: True if both verdicts agree; False triggers a consistency warning.
overall_score: 0.0 when deterministic_verdict is fail due to parse or execution failure; weighted aggregate otherwise.
structure_score
result_score
diagnostic_scores
performance_scores
severity
explanations
mismatch_map
error_types: list of Error Taxonomy categories triggered during execution (maps to the DML error taxonomy).
blocked_reason: present when downstream layers are blocked by an upstream failure; indicates which precedence rule triggered the block (e.g., "parse_failure" or "execution_failure").
warnings
validity: parse and execution outcomes
evidence: clause-level and attribute-level details
run_metadata: timings, db identifier, version, config hash

Scoring Strategy

Use weighted aggregation with explicit defaults:

validity gate: parse or execution failure sets deterministic_verdict to fail and overall_score to 0.0 before any weighted scoring runs.
result gate: result_equality_family.mode_pass acts as a boolean gate — a false mode_pass sets deterministic_verdict to fail. It does not contribute a numeric weight.
structure_score: weighted blend of normalized match, clause_match, and derived clause-weighted similarity (computed from clause_weighted_distance at scoring time).
result_score: weighted blend of schema, overlap, cardinality, tolerance, and NULL handling (mode_pass is the gate, not a blend input).
diagnostic_scores: not just penalties; include confidence and evidence count.

Derived metric note:

clause_weighted_similarity is not stored as a primary output; it is derived when needed from clause_weighted_distance.
execution_time_ratio is derived from execution_time_actual_ms / execution_time_expected_ms and is not independently stored in Validation.

Severity mapping example:

pass
minor issue
moderate issue
major issue
critical failure

Error Taxonomy

DML runtime execution error categories:

missing_table: Referenced relation does not exist in the database catalog.
missing_column: Referenced attribute does not exist in the relation schema.
type_mismatch: Operand data types are incompatible for the operation or predicate.
division_by_zero: Arithmetic expression produces a division-by-zero runtime error.
invalid_aggregation: Aggregate function used in an invalid context (e.g., non-grouped attribute in SELECT without GROUP BY).
ambiguous_reference: Attribute reference is ambiguous across joined relations without qualification.
permission_error: Executing principal lacks read permission on the referenced relation or schema.
unknown_error: Uncategorized runtime error; preserve raw DB error message for inspection.

Always store both the mapped taxonomy category and the raw database error message.

Deterministic Diagnostics Strategy

Detailed heuristic rules for the Diagnostic Engine (Component F). Layer 4 defines the output schema; this section defines the attribution logic.

Heuristic clause-attribution signals:

Projection mismatch: output relation schema or SELECT expression differs while tuple cardinality is close.
Predicate (filter) mismatch: high tuple overlap asymmetry correlated with WHERE/HAVING predicate divergence in the AST.
Join mismatch: cardinality fanout or compression combined with join graph topology or join predicate divergence.
Aggregation grain mismatch: aggregate score drift paired with incorrect or missing GROUP BY attributes.
Ordering-only mismatch: bag-semantics result sets are equal but ordered result sets differ — ordering_error_flag is set.
Top-k mismatch: result set divergence concentrated near the LIMIT/OFFSET truncation boundary.

Every diagnostic output must include:

what differed (clause and attribute level)
likely source clause (projection / selection predicate / join / aggregation grain / ordering / top-k)
confidence score
supporting evidence (cardinality delta, shared attribute overlap, predicate diff summary)

Performance and Explainability

Optional performance metrics should be collected only when stable and comparable:

execution_time_ratio
resource_cost_score
plan_complexity_score

Include a warning when noisy conditions reduce comparability (cold cache, variable load, missing plan stats).

AI Judge Guardrails

Normative rules for AI judge behavior. Verdict structure and output format are defined in Layer 5.

AI judge is optional and additive.
AI output is always advisory — deterministic_verdict from Layers 1–4 takes precedence.
AI cannot flip deterministic pass/fail flags.
When ai_advisory_verdict disagrees with deterministic_verdict, verdict_consistency_flag is set to False and a consistency warning is included in the report.
AI responses expose only: short_rationale, evidence_bullets, decision_summary. No internal reasoning trace is returned.
Include model identifier, prompt version, and confidence score.
Support policy-based disablement for production pipelines.

Implementation Roadmap

Phase 1: Core MVP

Build first:

parse success
execution success and error type
normalized SQL match
clause match
result_equality_family (mode-driven)
schema match
row overlap
cardinality match

Phase 2: Debugging Expansion

Add:

projection/filter/join/grouping/aggregate diagnostic scores
ordering-only mismatch flag
per-column mismatch map
numeric tolerance match
NULL handling match

Phase 3: Production and Research Features

Add:

performance metrics
AI judge metrics
richer severity policy profiles

Metrics Glossary

The Metrics Glossary is the normative reference for metric names, types, and descriptions. Layer sections provide architectural context and processing logic.

Validation Metrics

Metric	Layer	Type	Description
parse_success_actual	Validation	bool	True if actual_sql produces a valid AST with no syntax errors
parse_success_expected	Validation	bool	True if expected_sql produces a valid AST with no syntax errors
normalized_sql_actual	Validation	string	Canonicalized SQL string of actual_sql after normalization
normalized_sql_expected	Validation	string	Canonicalized SQL string of expected_sql after normalization
ast_actual	Validation	AST	Abstract Syntax Tree (AST) of actual_sql
ast_expected	Validation	AST	Abstract Syntax Tree (AST) of expected_sql
execution_success_actual	Validation	bool	True if actual_sql executes without a DML runtime error
execution_success_expected	Validation	bool	True if expected_sql executes without a DML runtime error
execution_error_actual	Validation	object	Error taxonomy category + raw DB error message for actual_sql
execution_error_expected	Validation	object	Error taxonomy category + raw DB error message for expected_sql
execution_time_actual_ms	Validation	float	Wall-clock runtime of actual_sql in milliseconds
execution_time_expected_ms	Validation	float	Wall-clock runtime of expected_sql in milliseconds
actual_result_table	Validation	relation	Materialized result set (tuples) returned by actual_sql
expected_result_table	Validation	relation	Materialized result set (tuples) returned by expected_sql
actual_schema	Validation	object	Output relation schema of actual_sql - attribute names and data types
expected_schema	Validation	object	Output relation schema of expected_sql - attribute names and data types

Structural Comparison Metrics

Metric	Role	Type	Description
normalized_sql_match	Boolean gate	bool	True if canonicalized SQL strings are identical; if true, deeper structural comparison can be skipped
clause_match	Diagnostic detail	object	Per-clause AST comparison: projection (SELECT), selection predicate (WHERE), post-aggregation predicate (HAVING), join graph and join type, aggregation grain (GROUP BY), ordering (ORDER BY), top-k (LIMIT/OFFSET), set operators (UNION/INTERSECT/EXCEPT)
clause_weighted_distance	Structural score	float	Clause-weighted structural divergence; 0 = identical, higher = more divergent

Diagnostic Attribution Metrics

Metric	Layer	Type	Description
projection_error_score	Diagnostic	float	Estimated impact from differences in the SELECT projection list - missing, extra, or transformed attributes
filter_error_score	Diagnostic	float	Estimated impact from predicate differences in WHERE (selection) and HAVING (post-aggregation filter)
join_error_score	Diagnostic	float	Estimated impact from join graph topology, join predicate (ON condition), or join type (INNER / LEFT / RIGHT / FULL / CROSS / ANTI / SEMI)
grouping_error_score	Diagnostic	float	Estimated impact from incorrect aggregation grain in the GROUP BY clause
aggregate_function_error_score	Diagnostic	float	Estimated impact from wrong aggregate function (SUM vs COUNT vs AVG vs MIN vs MAX vs STDDEV)
ordering_error_flag	Diagnostic	bool	True when mismatch is caused solely by tuple ordering; bag-semantics result sets are equal
limit_topk_error_score	Diagnostic	float	Estimated impact from LIMIT, TOP, FETCH FIRST, OFFSET, or ROW_NUMBER top-k differences
cardinality_explosion_flag	Diagnostic	bool	True when result set cardinality deviates significantly from expected due to join fanout or over-filtering
per_column_mismatch_map	Diagnostic	object	Explanation layer - attribute-level mismatch rate and severity breakdown explaining row_overlap and cell_overlap
confidence	Diagnostic	float	0.0-1.0 confidence in the primary diagnostic attribution
evidence_strength	Diagnostic	label	low / medium / high - based on corroborating signal count and quality
ambiguous_case	Diagnostic	bool	True when multiple clauses are plausible root causes with similar evidence weight

Result Comparison Metrics

Metric	Role	Type	Description
result_equality_family	Equality family	object	Single mode-driven equality family with comparison_mode (exact / order-insensitive / order-sensitive), mode_pass, and optional mode_details for diagnostics
schema_match	Schema	object	Relation schema comparison: attribute count, attribute names, data types, and optional attribute ordering
row_overlap	Graded quality	float	Tuple-level overlap expressed as Jaccard similarity or F1 score (precision + recall over shared tuples)
cell_overlap	Graded quality	float	Value-level (attribute x tuple) overlap after alignment; measures localized mismatch granularity
cardinality_match	Cardinality	object	Cardinality comparison - absolute row-count delta and relative cardinality ratio between result sets
numeric_tolerance_match	Tolerance	object	Floating-point comparison using atol and rtol: match if
null_handling_match	NULL semantics	bool	True if NULL placement and three-valued logic (3VL) behavior is consistent across result sets

Optional Performance Metrics

Metric	Layer	Type	Description
execution_time_ratio	Performance	float	Derived: ratio of execution_time_actual_ms to execution_time_expected_ms; not independently stored in Validation
resource_cost_score	Performance	float	Relative cost estimate from the query optimizer execution plan (EXPLAIN) if available
plan_complexity_score	Performance	float	Weighted count of expensive query plan operators: full scans, hash joins, sorts, nested loops

Final Output Fields

Metric	Layer	Type	Description
deterministic_verdict	Final output	label	Authoritative pass/fail derived from Layers 1-4 deterministic metrics
ai_advisory_verdict	Final output	label	Advisory classification from the AI judge; never overrides deterministic_verdict
verdict_consistency_flag	Final output	bool	True if both verdicts agree; False triggers a consistency warning
blocked_reason	Final output	string	Present when downstream layers are blocked by an upstream failure; indicates the triggering precedence rule

AI Judge Metrics

Metric	Layer	Type	Description
intent_adequacy_judgment	AI Judge	label	LLM-as-judge assessment of whether actual_sql satisfies user_query intent as well as expected_sql
database_scoped_equivalence_judgment	AI Judge	label	LLM assessment of observed equivalence on this database instance - does not claim universal semantic equivalence
acceptable_deviation_judgment	AI Judge	label	Grounded classification: harmless (ordering-only), acceptable (minor rounding), or critical (wrong data)
short_rationale	AI Judge	string	One-to-two sentence summary of the AI judge's reasoning
evidence_bullets	AI Judge	list	Ordered list of key deterministic signals used in the AI judgment
decision_summary	AI Judge	string	Final label and justification without internal reasoning traces
severity_classification	AI Judge	label	Severity triage label: pass / minor issue / moderate issue / major issue / critical failure
confidence	AI Judge	float	0.0-1.0 confidence in the AI judgment
evidence_strength	AI Judge	label	low / medium / high - based on quality of deterministic evidence available
ambiguous_case	AI Judge	bool	True when the AI judge cannot distinguish between two plausible classifications

Build a Production‑Ready SQL Evaluation Engine for LLMs

kasi viswanath vandanapu — Mon, 30 Mar 2026 17:35:23 +0000

Intro

When I first started building a text‑to‑SQL system, the obvious thing was to run the generated query against a database and compare the result with a ground truth. That worked for a handful of examples, but as soon as we hit hundreds of user queries, the naive approach broke down: it was slow, brittle, and offered no insight into why a query failed.

What I needed was a two‑layer engine:

Fast deterministic checks that catch the most common mistakes in under a second.
An AI judge that digs deeper when those checks fail, tells you exactly what’s missing or wrong, and even spits out a corrected SQL snippet.

Below is my complete, production‑ready framework (no storage, no UI). I’ll walk through the architecture, show you the core code, and explain how to plug it into your own pipeline. By the end, you’ll have a reusable tool that turns every LLM‑generated query into actionable feedback—perfect for continuous model improvement.

1. Why Two Layers?

Layer	Purpose	Typical Cost	Speed
Deterministic	Quick sanity checks (row count, column coverage, AST structure)	Negligible	< 0.5 s per query
AI Judge	Deep semantic review + actionable suggestions	Medium (model token cost)	1–3 s per query

The deterministic layer filters out the obvious failures—missing joins, wrong aggregates, mismatched row counts—so we only pay for the expensive AI pass when something truly needs human‑level reasoning. This keeps overall costs low while still giving you rich diagnostics.

2. High‑Level Flow

User Query + Expected SQL
          │
          ▼
[Deterministic Evaluator] → scores, verdict
          │
          ├─ If score ≥ 0.92 → return quick summary
          │
          ▼
[AI Judge] (LLM) → detailed JSON: missing elements, root cause, suggested fix

The evaluator returns a dictionary with metrics like row_count_match, column_coverage, and an overall weighted score. If the score is high enough we skip the AI step; otherwise we call the LLM with a carefully crafted prompt that forces it to output structured JSON.

3. Core Code

Save this as sql_eval_framework.py. It contains:

deterministic_evaluate() – the fast checks.
build_judge_prompt() – template for the AI step.
evaluate_batch() – async batch runner with concurrency control.


python
# sql_eval_framework.py
import asyncio, json
from typing import Dict, Any, List
import pandas as pd
from sqlglot import parse_one, exp
from litellm import acompletion  # pip install litellm
from pydantic import BaseModel, Field

# ------------------------------------------------------------------
# 1️⃣ Deterministic Evaluator (80/20 + extras)
# ------------------------------------------------------------------
def deterministic_evaluate(
    expected_sql: str,
    llm_sql: str,
    expected_df: pd.DataFrame,
    llm_df: pd.DataFrame
) -> Dict[str, Any]:
    """
    Fast checks that run in < 0.5 s per query.
    Returns a dict with scores and a weighted overall score.
    """

    # --- Row count -------------------------------------------------
    row_match = len(expected_df) == len(llm_df)
    row_delta_pct = abs(len(expected_df) - len(llm_df)) / max(len(expected_df), 1)

    # --- Column coverage --------------------------------------------
    exp_cols, llm_cols = set(expected_df.columns), set(llm_df.columns)
    col_coverage = len(exp_cols & llm_cols) / len(exp_cols) if exp_cols else 0
    missing_cols = list(exp_cols - llm_cols)
    extra_cols   = list(llm_cols - exp_cols)

    # --- Result set match (order‑insensitive) -----------------------
    try:
        exp_sorted = expected_df.sort_values(by=list(expected_df.columns)).reset_index(drop=True)
        llm_sorted  = llm_df.sort_values(by=list(llm_df.columns)).reset_index(drop=True)
        result_exact = exp_sorted.equals(llm_sorted)
    except Exception:
        result_exact = False

    # --- Basic AST structural checks --------------------------------
    try:
        ast_exp = parse_one(expected_sql, read="postgres")
        ast_llm = parse_one(llm_sql, read="postgres")

        tables_match   = {t.this for t in ast_exp.find_all(exp.Table)} == \
                         {t.this for t in ast_llm.find_all(exp.Table)}
        where_exists   = bool(ast_exp.args.get("where")) == bool(ast_llm.args.get("where"))
        group_by_match = str(ast_exp.args.get("group")) == str(ast_llm.args.get("group"))
    except Exception:
        tables_match, where_exists, group_by_match = False, False, False

    # --- Weighted score ---------------------------------------------
    weights = {
        "row": 0.15,
        "col": 0.20,
        "result": 0.25,
        "tables": 0.10,
        "where": 0.10,
        "group_by": 0.10
    }
    overall_score = (
        (1 if row_match else 0) * weights["row"] +
        col_coverage * weights["col"] +
        (1 if result_exact else 0) * weights["result"] +
        (1 if tables_match else 0) * weights["tables"] +
        (1 if where_exists else 0) * weights["where"] +
        (1 if group_by_match else 0) * weights["group_by"]
    )

    return {
        "row_count_match": row_match,
        "row_delta_pct": round(row_delta_pct, 3),
        "column_coverage": round(col_coverage, 2),
        "missing_columns": missing_cols,
        "extra_columns": extra_cols,
        "result_exact_match": result_exact,
        "tables_match": tables_match,
        "where_exists_match": where_exists,
        "group_by_match": group_by_match,
        "deterministic_score": round(overall_score, 2),
    }

# ------------------------------------------------------------------
# 2️⃣ AI Judge Prompt Builder
# ------------------------------------------------------------------
def build_judge_prompt(
    user_query: str,
    schema_ddl: str,
    expected_sql: str,
    llm_sql: str,
    exp_sample_csv: str,
    llm_sample_csv: str,
    exp_rows: int,
    llm_rows: int,
    exp_cols: List[str],
    llm_cols: List[str]
) -> str:
    """
    Returns a prompt that forces the LLM to output *only* JSON.
    The prompt includes enough context (schema, samples, row counts)
    so the model can reason about missing or extra elements.
    """

    return f"""
You are an expert SQL analyst and debugger with 15+ years experience.

Task: Deeply compare the **Expected SQL** (ground truth) with the **LLM-generated SQL**.
User natural‑language query: {user_query}

Database Schema (full DDL):
{schema_ddl}

Expected SQL (ground truth):
{expected_sql}

LLM-generated SQL:
{llm_sql}

Expected Result (first 10 rows as CSV):
{exp_sample_csv}

LLM Result (first 10 rows as CSV):
{llm_sample_csv}

Row count — Expected: {exp_rows} | LLM: {llm_rows}
Columns — Expected: {', '.join(exp_cols)} | LLM: {', '.join(llm_cols)}

Evaluate using this exact rubric. Think step‑by‑step.

1. **Semantic Correctness** (0-100): How well does the LLM SQL answer the user query compared to the expected SQL?
2. **What is MISSING?** List every important element present in Expected but absent/mis‑implemented in LLM SQL (filters, joins, aggregations, computed columns, groupings, ordering, etc.).
3. **What is EXTRA or WRONG?** List anything in LLM SQL that should not be there or is incorrect.
4. **Root Cause Hypothesis**: Why might the LLM have made these mistakes given the original prompt?
5. **Suggested Fix**: Provide a complete, corrected SQL that would pass all checks. Make minimal changes possible while fully matching the expected semantics.

Return ONLY valid JSON (no markdown):
{{
  "semantic_score": integer 0-100,
  "missing_elements": ["filter: status='ACTIVE'", "join on customer_id", ...],
  "extra_or_wrong_elements": ["unnecessary ORDER BY", ...],
  "root_cause": "string explanation",
  "suggested_improved_sql": "full corrected SQL here",
  "overall_verdict": "PASS | PARTIAL | FAIL",
  "explanation": "concise paragraph"
}}
"""

# ------------------------------------------------------------------
# 3️⃣ Batch Evaluator (async, concurrency control)
# ------------------------------------------------------------------
async def evaluate_batch(
    test_cases: List[Dict[str, Any]],
    max_parallel: int = 10,
    judge_model: str = "claude-3-5-haiku-20241022"
) -> List[Dict[str, Any]]:
    """
    test_cases is a list of dicts with keys:
        user_query, schema_ddl, expected_sql, llm_sql,
        expected_df, llm_df
    Returns each case enriched with deterministic scores and, if needed,
    the AI judge JSON.
    """

    semaphore = asyncio.Semaphore(max_parallel)

    async def process_one(case: Dict[str, Any]) -> Dict[str, Any]:
        # 1️⃣ Deterministic pass
        det_scores = deterministic_evaluate(
            case["expected_sql"],
            case["llm_sql"],
            case["expected_df"],
            case["llm_df"]
        )
        result = {**case, **det_scores}

        # 2️⃣ If high enough, skip AI judge
        if det_scores["deterministic_score"] >= 0.92:
            result["judge_review"] = None
            return result

        # 3️⃣ Build prompt and call LLM
        exp_sample_csv = case["expected_df"].head(10).to_csv(index=False)
        llm_sample_csv = case["llm_df"].head(10).to_csv(index=False)

        prompt = build_judge_prompt(
            user_query=case["user_query"],
            schema_ddl=case["schema_ddl"],
            expected_sql=case["expected_sql"],
            llm_sql=case["llm_sql"],
            exp_sample_csv=exp_sample_csv,
            llm_sample_csv=llm_sample_csv,
            exp_rows=len(case["expected_df"]),
            llm_rows=len(case["llm_df"]),
            exp_cols=list(case["expected_df"].columns),
            llm_cols=list(case["llm_df"].columns)
        )

        response = await acompletion(
            model=judge_model,
            messages=[{"role": "user", "content": prompt}]
        )
        # Parse the JSON safely
        try:
            review = json.loads(response.choices[0].message.content)
        except Exception as e:
            review = {"error": str(e), "raw_response": response.choices[0].message

Streaming AI Agent with FastAPI & LangGraph (2025‑26 Guide)

kasi viswanath vandanapu — Mon, 16 Feb 2026 21:05:38 +0000

Streaming AI Agent with FastAPI & LangGraph (2025‑26 Guide)

Real‑world AI workflows—data pipelines, planning bots, ETL orchestration—often leave users staring at a loading spinner for 10–60 seconds. In 2025‑26 the expectation is instant feedback: step transitions, progress bars, intermediate results, and especially human‑in‑the‑loop (HITL) pauses.

In this post I’ll walk you through a production‑ready stack that:

Accepts a user message plus an optional file upload via POST
Runs a LangGraph agent with planning → data‑model resolution → HITL review
Streams every meaningful step and progress update to the frontend using Server‑Sent Events (SSE)
Uses get_stream_writer() so any node can emit clean, structured progress messages without manual callbacks
Persists sessions via thread_id / session_id
Comes with a minimal React front‑end that consumes the stream

Let’s dive in.

Tech Stack

Component	Version / Notes
Python	3.11+ (required for reliable async `get_stream_writer()` propagation)
FastAPI	0.115+ (built‑in streaming & file upload support)
LangGraph	0.2.70+ (latest as of Feb 2026)
LangChain ecosystem	0.1.x+
python-multipart	`pip install python-multipart` (for `UploadFile`)
React	18+ (TypeScript optional, shown with plain JS for brevity)
Uvicorn	For running the FastAPI app locally

Gotcha: Use Python ≥ 3.11 to avoid ContextVar issues when propagating the stream writer across async tasks.

Project Structure

streaming-ai-agent/
├── app/
│   ├── __init__.py
│   ├── main.py               # FastAPI app entry point
│   ├── api/
│   │   └── router.py         # /api/stream endpoint
│   ├── graph/
│   │   ├── __init__.py
│   │   ├── state.py          # TypedDict for AgentState
│   │   ├── nodes/
│   │   │   ├── __init__.py
│   │   │   ├── plan_data_model.py
│   │   │   └── hitl_review.py
│   │   └── service.py        # Graph builder + streaming logic
│   └── dependencies.py       # (optional) shared checkpointer
├── frontend/                 # React app (Vite / CRA)
│   └── src/
│       └── App.jsx           # Streaming consumer example
├── requirements.txt
└── README.md

Step 1: Define the State

We’ll use a TypedDict with Annotated to let LangGraph know how to merge message lists.

# app/graph/state.py
from typing import Annotated, TypedDict, List
from langgraph.graph.message import add_messages
from langchain_core.messages import BaseMessage

class AgentState(TypedDict):
    messages: Annotated[List[BaseMessage], add_messages]
    user_message: str
    file_content: str | None
    data_model_plan: dict | None
    resolved_model: dict | None
    human_feedback: str | None
    status: str

Step 2: Create Nodes with Progress Emission

Each node pulls the stream writer, emits a structured event (JSON‑serialisable dict), and updates state. No real LLM calls are needed for this demo.

# app/graph/nodes/plan_data_model.py
from langgraph.config import get_stream_writer
from ..state import AgentState

def plan_data_model_node(state: AgentState) -> AgentState:
    writer = get_stream_writer()

    writer({
        "type": "step_start",
        "step": "plan_data_model",
        "message": "Analyzing user request and file (if any)..."
    })

    # Simulated planning logic
    plan = {
        "entities": ["user", "order", "product"],
        "relationships": ["user → order", "order → product"]
    }
    if state.get("file_content"):
        plan["note"] = "CSV detected → adding transaction fields"

    writer({
        "type": "step_update",
        "step": "plan_data_model",
        "progress": 0.6,
        "message": "Generated initial data model plan"
    })

    return {
        **state,
        "data_model_plan": plan,
        "status": "planning_complete"
    }

# app/graph/nodes/hitl_review.py
from langgraph.config import get_stream_writer
from langgraph.types import interrupt
from ..state import AgentState

def hitl_review_node(state: AgentState) -> AgentState:
    writer = get_stream_writer()

    writer({
        "type": "waiting_human",
        "step": "human_review",
        "message": "Please review the proposed data model",
        "plan": state["data_model_plan"]
    })

    # Raise an interrupt so the agent pauses until we resume with feedback
    raise interrupt(
        value={"needs_review": True}
    )

    # Code after this line runs only when resumed
    return state

Step 3: Build & Stream the Graph (`service.py`)

# app/graph/service.py
import json
from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.memory import MemorySaver
from .state import AgentState
from .nodes.plan_data_model import plan_data_model_node
from .nodes.hitl_review import hitl_review_node

checkpointer = MemorySaver()

graph = StateGraph(state_schema=AgentState)

# Register nodes
graph.add_node("plan_data_model", plan_data_model_node)
graph.add_node("human_review", hitl_review_node)

# Define edges
graph.add_edge(START, "plan_data_model")
graph.add_edge("plan_data_model", "human_review")
graph.add_edge("human_review", END)  # or loop back if resumed

compiled_graph = graph.compile(checkpointer=checkpointer)

async def stream_agent_process(
    inputs: dict,
    thread_id: str,
    file_content: str | None = None
):
    config = {"configurable": {"thread_id": thread_id}}

    initial_state = {
        "messages": [],
        "user_message": inputs.get("user_message", ""),
        "file_content": file_content,
        "status": "starting"
    }

    async for event in compiled_graph.astream(
        initial_state,
        config=config,
        stream_mode=["custom", "updates", "messages"],
        subgraphs=True
    ):
        # When `subgraphs=True`, events are tuples (namespace, payload)
        if isinstance(event, tuple):
            mode, payload = event
        else:
            mode, payload = "unknown", event

        yield f"data: {json.dumps({'mode': mode, 'payload': payload})}\n\n"

Important gotchas

subgraphs=True → events become (namespace, data) tuples.
Combine modes (["custom","updates","messages"]) to get progress events and state updates.
Use json.dumps() so the SSE payload is safe and parseable on the client.

Step 4: FastAPI Streaming Endpoint with File Upload

# app/api/router.py
from fastapi import APIRouter, UploadFile, Form, HTTPException
from fastapi.responses import StreamingResponse
import json
from ..graph.service import stream_agent_process

router = APIRouter(prefix="/api")

@router.post("/stream")
async def stream_agent(
    user_message: str = Form(...),
    session_id: str = Form(...),
    file: UploadFile | None = None
):
    file_content = None
    if file:
        content = await file.read()
        file_content = content.decode("utf-8", errors="ignore")

    async def event_generator():
        try:
            async for chunk in stream_agent_process(
                {"user_message": user_message},
                thread_id=session_id,
                file_content=file_content
            ):
                yield chunk
        except Exception as e:
            yield f"data: {json.dumps({'error': str(e)})}\n\n"

    return StreamingResponse(
        event_generator(),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "X-Accel-Buffering": "no"  # Important for nginx proxy
        }
    )

Add router to your FastAPI app in main.py.

Do not set the Content-Type: application/json header on the frontend fetch; SSE expects text/event-stream.

Step 5: React Frontend – Consume the Stream


jsx
// frontend/src/App.jsx
import { useState } from "react";

function App() {
  const [logs, setLogs] = useState([]);
  const [status, setStatus] = useState("Ready");
  const sessionId = crypto.randomUUID();

  const handleSubmit = async (e) => {
    e.preventDefault();
    const form = e.target;
    const message = form.message.value;
    const fileInput = form.file.files?.[0];

    const formData = new FormData();
    formData.append("user_message", message);
    formData.append("session_id", sessionId);
    if (fileInput) formData.append("file", fileInput);

    const response = await fetch("/api/stream", {
      method: "POST",
      body: formData
    });

    if (!response.body) return;

    const reader = response.body.getReader();
    const decoder = new TextDecoder();

    while (true) {
      const { done, value } = await reader.read();
      if (done) break;
      const chunk = decoder.decode(value);
      const lines = chunk.split("\n");

      for (const line of lines) {
        if (!line.startsWith("data: ")) continue;
        try {
          const data = JSON.parse(line.slice(6));
          const { mode, payload } = data;

          if (mode === "custom") {
            switch (payload.type) {
              case "step_start":
                setLogs((l) => [...l, `→ Starting: ${payload.step}`]);
                break;
              case "step_update":
                setLogs((l) => [
                  ...l,
                  `  Update: ${payload.message} (${Math.round(payload.progress * 100)}%)`
                ]);
                break;
              case "waiting_human":
                setStatus("Waiting for your review…");
                setLogs((l) => [...l

Forem: kasi viswanath vandanapu

SQL Comparison Library Architecture

Purpose

Design Principles

High-Level Architecture

1) Validation Layer

2) Structural Comparison Layer

3) Result Comparison Layer

4) Diagnostic Attribution Layer

5) Optional AI Judge Layer

Core Components

A. Request Model

B. Execution Adapter

C. SQL Parser and Normalizer

D. Structural Comparator

E. Result Comparator

F. Diagnostic Engine

G. Scoring and Severity Engine

H. Report Builder

I. Optional AI Judge Adapter

Metric Processing Pipeline

Failure-State Precedence Rules

Configuration Model

Suggested Output Contract

Scoring Strategy

Error Taxonomy

Deterministic Diagnostics Strategy

Performance and Explainability

AI Judge Guardrails

Implementation Roadmap

Phase 1: Core MVP

Phase 2: Debugging Expansion

Phase 3: Production and Research Features

Metrics Glossary

Validation Metrics

Structural Comparison Metrics

Diagnostic Attribution Metrics

Result Comparison Metrics

Optional Performance Metrics

Final Output Fields

AI Judge Metrics

Build a Production‑Ready SQL Evaluation Engine for LLMs

Intro

1. Why Two Layers?

2. High‑Level Flow

3. Core Code

Streaming AI Agent with FastAPI & LangGraph (2025‑26 Guide)

Streaming AI Agent with FastAPI & LangGraph (2025‑26 Guide)

Tech Stack

Project Structure

Step 1: Define the State

Step 2: Create Nodes with Progress Emission

Step 3: Build & Stream the Graph (service.py)

Step 4: FastAPI Streaming Endpoint with File Upload

Step 5: React Frontend – Consume the Stream

Step 1: Define the State

Step 2: Create Nodes with Progress Emission

Step 3: Build & Stream the Graph (`service.py`)

Step 4: FastAPI Streaming Endpoint with File Upload

Step 5: React Frontend – Consume the Stream