TL;DR
We built a system that combines JSON Schema validation with LLMs to automatically fix malformed data. It's more powerful than regex-based fixes, more reliable than pure AI approaches, and saves our team countless hours of manual data cleanup.
The Problem: Data is Messy
At Swiftgum, we process large volumes of real estate contract data. Much of this comes from OCR or third-party integrations, and it's rarely perfect:
- Number fields containing text:
"annual_rent": "25,000€ excl. tax"
- Dates in various formats:
"effective_date": "first of January, 2023"
- Domain-specific notation:
"duration": "3+6+9"
(a common commercial lease structure)
Standard validation would reject these values outright. Manual correction takes hours. So we built something better.
The Architecture: Validation + AI
Here's our solution in pseudocode:
async function fixValueWithAI<T>(
schema: JSONSchema,
value: unknown,
options?: {
maxAttempts?: number;
model?: string;
}
): Promise<T> {
// Try validating first - maybe it's already valid
const validationResult = safeValidate(schema, value);
if (validationResult.valid) {
return validationResult.value as T;
}
// Not valid, let's try AI correction
const attempts = options?.maxAttempts ?? 2;
let currentValue = value;
for (let i = 0; i < attempts; i++) {
try {
// Prepare instructions for the LLM
const prompt = buildCorrectionPrompt({
schema: stripInternalProperties(schema),
originalValue: currentValue,
errors: validationResult.error,
});
// Get AI correction
const aiOutput = await askAI(prompt, options?.model ?? "gpt-4o-mini");
// Check if AI signals it cannot fix
const escapeHatchCheck = EscapeHatchSchema.safeParse(aiOutput);
if (escapeHatchCheck.success) {
throw new Error("AI indicates it cannot fix the value");
}
// Validate AI's proposed fix
const newValidationResult = safeValidate(schema, aiOutput);
if (newValidationResult.valid) {
return newValidationResult.value as T;
}
// Still not valid, but use this as the starting point for next attempt
currentValue = aiOutput;
} catch (err) {
// Log the error but continue to next attempt
logger.warn(`AI correction attempt ${i + 1} failed`, { error: err });
}
}
// All attempts failed
throw new ValidationError("Could not correct value after multiple attempts");
}
The Secret Sauce: Crafting Effective Prompts
The prompt we send to the LLM is crucial. Here's the template we use:
function buildCorrectionPrompt({ schema, originalValue, errors }) {
return `
You are a data correction expert. Your task is to fix a JSON value that fails validation.
THE JSON SCHEMA:
${JSON.stringify(schema, null, 2)}
THE ORIGINAL VALUE:
${JSON.stringify(originalValue, null, 2)}
VALIDATION ERRORS:
${JSON.stringify(errors, null, 2)}
INSTRUCTIONS:
1. Analyze the schema requirements and validation errors
2. Transform the original value to make it conform to the schema
3. ONLY fix what's wrong - preserve all other data
4. DO NOT invent values if you don't have enough information
5. If you cannot fix the value without guessing, respond with {"cannotFix": true}
6. Respond ONLY with the fixed JSON object or the cannotFix object
FIXED VALUE:
`;
}
A few key points that make this effective:
- Simplified schema: We strip internal properties to focus the LLM on the relevant parts
- Clear validation errors: We transform AJV errors into a more readable format
-
Escape hatch: The
{"cannotFix": true}
option prevents wild guesses - Multiple attempts: Each correction attempt builds on the previous one
Real-World Example
Here's a real example from our production system:
Schema:
{
"type": "object",
"properties": {
"tenant": {
"type": "string",
"description": "Name of the tenant company"
},
"annual_rent": {
"type": "number",
"minimum": 0,
"description": "Annual rent in euros"
},
"effective_date": {
"type": "string",
"format": "date",
"description": "Start date of the lease"
},
"duration": {
"type": "integer",
"minimum": 1,
"description": "Duration of the lease in years"
}
},
"required": ["tenant", "annual_rent", "effective_date"]
}
Original value (from OCR):
{
"tenant": "SCI Les Oliviers",
"annual_rent": "25 000€ excl. tax",
"effective_date": "first of January 2023",
"duration": "3+6+9"
}
Validation errors:
[
{
"path": "/annual_rent",
"message": "must be number, found: string",
"value": "25 000€ excl. tax"
},
{
"path": "/effective_date",
"message": "invalid date format",
"value": "first of January 2023"
},
{
"path": "/duration",
"message": "must be integer, found: string",
"value": "3+6+9"
}
]
AI-corrected value:
{
"tenant": "SCI Les Oliviers",
"annual_rent": 25000,
"effective_date": "2023-01-01",
"duration": 3
}
Note how the AI properly:
- Extracted the numeric value from the rent string
- Converted the textual date to ISO format
- Used the first number from the commercial lease notation
Implementation Details
Our tech stack:
- OpenAI models (currently gpt-4o-mini-2024-07-18)
- Vercel AI SDK for streamlined LLM integration
- AJV for JSON Schema validation
- Zod for TypeScript-native validation
Here's a simplified version of our validation wrapper:
function safeValidate(schema: JSONSchema, value: unknown): ValidationResult {
try {
// Use AJV for validation
const valid = ajv.validate(schema, value);
if (valid) {
return { valid: true, value };
} else {
return {
valid: false,
error: transformAjvErrors(ajv.errors),
};
}
} catch (err) {
// Handle errors in the validation process itself
return {
valid: false,
error: [
{
path: "",
message: "Validation process failed",
value,
},
],
};
}
}
// Transform AJV's verbose errors into a more concise format
function transformAjvErrors(errors: Ajv.ErrorObject[]): ErrorDetail[] {
return errors.map((err) => ({
path: err.instancePath,
message: err.message,
value: getValueAtPath(originalValue, err.instancePath),
}));
}
Key Learnings
After implementing this in production, we discovered:
LLMs understand JSON Schema natively - they're trained on enough examples to grasp the semantics well
The escape hatch is crucial -
{"cannotFix": true}
prevents hallucinated data when correction is impossibleMultiple attempts improve success rates - The first pass might fix 2/3 errors, then the second pass can address the remainder
Error costs guide implementation - In our domain, false corrections are more costly than failed corrections, so we err on the side of caution
Prompt design is critical - Clear instructions, simplified schema, and structured error details all improve correction quality
Results
After deploying this system:
- 85% reduction in manual data corrections
- 99.2% accuracy on the corrections made automatically
- ~3 seconds average processing time per correction
- Successful handling of complex real estate-specific formats
When to Use This Pattern
This approach shines when:
- You have well-defined data schemas
- Manual correction is expensive or time-consuming
- Data errors follow patterns but aren't simple enough for regex
- The cost of incorrect data is high
Conclusion
Combining JSON Schema validation with LLM-based correction gives you the best of both worlds: the reliability of strict validation with the flexibility of AI. It's a pattern we've found incredibly useful for maintaining data quality while reducing manual work.
The code shown is simplified but captures the core concepts. Feel free to adapt it to your own validation system!
What data quality challenges is your team facing? I'd love to hear about your approaches in the comments!
This article is a developer port of comment nous utilisons l'IA pour corriger les données chez Swiftgum, initially published on the Swiftgum blog.
Top comments (8)
Nice post for AI developers!
Which platform do you think is better, chatgpt or grok?
Based on our experience, Gemini provides the most hallucination-free answers at the lower cost. 4o-mini can also be used for small coercions but as we are talking about large prompts, its limited context recall hinders efficiency.
Have yet to try grok on these use cases, but we generally use it for higher temperature/creative thoughts so it may get a little bit off track here. Let us know if you do some benchmarking!
Thanks
Pretty cool seeing real numbers behind an actual cleanup system – I wish I didn't spend so much time doing this stuff manually myself honestly.
I guess having a good manual validation dataset is always a plus, especially since fine tuning is getting easier and easier. There is however a catch into building these datasets as there needs to be strict rules about what to "correct", vs what to "infer".
In our case of building document processing pipelines, the 2-step process (extraction & correction) needs to be executed in isolation. If we create validation data for the correction step by proofreading the files, we effectively hinder the results and are likelier to have hallucinations.
It all depends on how your process is structured!
Love how you combined strict validation with smart prompt design for the LLM corrections - did you run into any edge cases where the AI struggled even after multiple attempts?
There is always going to be edge cases but there are multiple ways to look forward depending on the use case. One main catch is trying to correct longer JSON objects. The schema itself is quite lengthy, and the values can get super long.
The low hanging fruit for this kind of optimisation is to split the object and schema, but a naïve approach can give quite bad results. The following JSON schema and JSON objects are quite badly structured but they illustrate the point well:
Incoming data is:
The obvious correction here is to switch type "Car" to type "Bus", however if we only focused on the smallest object and its corresponding schema, we are locked into "Car".
There are heuristics that may be important to your use case regarding this. For us, we process each top-level object field separately as they are closely coupled to the way we define extraction schemas and represent a "self-sustaining" unit of data.
Pretty cool honestly - been deep in this kind of mess myself and it's always nice to see someone actually push for better solutions instead of just more manual work.