DEV Community

Titouan Launay
Titouan Launay

Posted on

5 2 1 2

Practical AI: Building a Robust Data Correction System with JSON Schema and LLMs

TL;DR

We built a system that combines JSON Schema validation with LLMs to automatically fix malformed data. It's more powerful than regex-based fixes, more reliable than pure AI approaches, and saves our team countless hours of manual data cleanup.

The Problem: Data is Messy

At Swiftgum, we process large volumes of real estate contract data. Much of this comes from OCR or third-party integrations, and it's rarely perfect:

  • Number fields containing text: "annual_rent": "25,000€ excl. tax"
  • Dates in various formats: "effective_date": "first of January, 2023"
  • Domain-specific notation: "duration": "3+6+9" (a common commercial lease structure)

Standard validation would reject these values outright. Manual correction takes hours. So we built something better.

The Architecture: Validation + AI

Here's our solution in pseudocode:

async function fixValueWithAI<T>(
  schema: JSONSchema,
  value: unknown,
  options?: {
    maxAttempts?: number;
    model?: string;
  }
): Promise<T> {
  // Try validating first - maybe it's already valid
  const validationResult = safeValidate(schema, value);
  if (validationResult.valid) {
    return validationResult.value as T;
  }

  // Not valid, let's try AI correction
  const attempts = options?.maxAttempts ?? 2;
  let currentValue = value;

  for (let i = 0; i < attempts; i++) {
    try {
      // Prepare instructions for the LLM
      const prompt = buildCorrectionPrompt({
        schema: stripInternalProperties(schema),
        originalValue: currentValue,
        errors: validationResult.error,
      });

      // Get AI correction
      const aiOutput = await askAI(prompt, options?.model ?? "gpt-4o-mini");

      // Check if AI signals it cannot fix
      const escapeHatchCheck = EscapeHatchSchema.safeParse(aiOutput);
      if (escapeHatchCheck.success) {
        throw new Error("AI indicates it cannot fix the value");
      }

      // Validate AI's proposed fix
      const newValidationResult = safeValidate(schema, aiOutput);
      if (newValidationResult.valid) {
        return newValidationResult.value as T;
      }

      // Still not valid, but use this as the starting point for next attempt
      currentValue = aiOutput;
    } catch (err) {
      // Log the error but continue to next attempt
      logger.warn(`AI correction attempt ${i + 1} failed`, { error: err });
    }
  }

  // All attempts failed
  throw new ValidationError("Could not correct value after multiple attempts");
}
Enter fullscreen mode Exit fullscreen mode

The Secret Sauce: Crafting Effective Prompts

The prompt we send to the LLM is crucial. Here's the template we use:

function buildCorrectionPrompt({ schema, originalValue, errors }) {
  return `
You are a data correction expert. Your task is to fix a JSON value that fails validation.

THE JSON SCHEMA:
${JSON.stringify(schema, null, 2)}

THE ORIGINAL VALUE:
${JSON.stringify(originalValue, null, 2)}

VALIDATION ERRORS:
${JSON.stringify(errors, null, 2)}

INSTRUCTIONS:
1. Analyze the schema requirements and validation errors
2. Transform the original value to make it conform to the schema
3. ONLY fix what's wrong - preserve all other data
4. DO NOT invent values if you don't have enough information
5. If you cannot fix the value without guessing, respond with {"cannotFix": true}
6. Respond ONLY with the fixed JSON object or the cannotFix object

FIXED VALUE:
`;
}
Enter fullscreen mode Exit fullscreen mode

A few key points that make this effective:

  1. Simplified schema: We strip internal properties to focus the LLM on the relevant parts
  2. Clear validation errors: We transform AJV errors into a more readable format
  3. Escape hatch: The {"cannotFix": true} option prevents wild guesses
  4. Multiple attempts: Each correction attempt builds on the previous one

Real-World Example

Here's a real example from our production system:

Schema:

{
  "type": "object",
  "properties": {
    "tenant": {
      "type": "string",
      "description": "Name of the tenant company"
    },
    "annual_rent": {
      "type": "number",
      "minimum": 0,
      "description": "Annual rent in euros"
    },
    "effective_date": {
      "type": "string",
      "format": "date",
      "description": "Start date of the lease"
    },
    "duration": {
      "type": "integer",
      "minimum": 1,
      "description": "Duration of the lease in years"
    }
  },
  "required": ["tenant", "annual_rent", "effective_date"]
}
Enter fullscreen mode Exit fullscreen mode

Original value (from OCR):

{
  "tenant": "SCI Les Oliviers",
  "annual_rent": "25 000€ excl. tax",
  "effective_date": "first of January 2023",
  "duration": "3+6+9"
}
Enter fullscreen mode Exit fullscreen mode

Validation errors:

[
  {
    "path": "/annual_rent",
    "message": "must be number, found: string",
    "value": "25 000€ excl. tax"
  },
  {
    "path": "/effective_date",
    "message": "invalid date format",
    "value": "first of January 2023"
  },
  {
    "path": "/duration",
    "message": "must be integer, found: string",
    "value": "3+6+9"
  }
]
Enter fullscreen mode Exit fullscreen mode

AI-corrected value:

{
  "tenant": "SCI Les Oliviers",
  "annual_rent": 25000,
  "effective_date": "2023-01-01",
  "duration": 3
}
Enter fullscreen mode Exit fullscreen mode

Note how the AI properly:

  • Extracted the numeric value from the rent string
  • Converted the textual date to ISO format
  • Used the first number from the commercial lease notation

Implementation Details

Our tech stack:

  • OpenAI models (currently gpt-4o-mini-2024-07-18)
  • Vercel AI SDK for streamlined LLM integration
  • AJV for JSON Schema validation
  • Zod for TypeScript-native validation

Here's a simplified version of our validation wrapper:

function safeValidate(schema: JSONSchema, value: unknown): ValidationResult {
  try {
    // Use AJV for validation
    const valid = ajv.validate(schema, value);

    if (valid) {
      return { valid: true, value };
    } else {
      return {
        valid: false,
        error: transformAjvErrors(ajv.errors),
      };
    }
  } catch (err) {
    // Handle errors in the validation process itself
    return {
      valid: false,
      error: [
        {
          path: "",
          message: "Validation process failed",
          value,
        },
      ],
    };
  }
}

// Transform AJV's verbose errors into a more concise format
function transformAjvErrors(errors: Ajv.ErrorObject[]): ErrorDetail[] {
  return errors.map((err) => ({
    path: err.instancePath,
    message: err.message,
    value: getValueAtPath(originalValue, err.instancePath),
  }));
}
Enter fullscreen mode Exit fullscreen mode

Key Learnings

After implementing this in production, we discovered:

  1. LLMs understand JSON Schema natively - they're trained on enough examples to grasp the semantics well

  2. The escape hatch is crucial - {"cannotFix": true} prevents hallucinated data when correction is impossible

  3. Multiple attempts improve success rates - The first pass might fix 2/3 errors, then the second pass can address the remainder

  4. Error costs guide implementation - In our domain, false corrections are more costly than failed corrections, so we err on the side of caution

  5. Prompt design is critical - Clear instructions, simplified schema, and structured error details all improve correction quality

Results

After deploying this system:

  • 85% reduction in manual data corrections
  • 99.2% accuracy on the corrections made automatically
  • ~3 seconds average processing time per correction
  • Successful handling of complex real estate-specific formats

When to Use This Pattern

This approach shines when:

  1. You have well-defined data schemas
  2. Manual correction is expensive or time-consuming
  3. Data errors follow patterns but aren't simple enough for regex
  4. The cost of incorrect data is high

Conclusion

Combining JSON Schema validation with LLM-based correction gives you the best of both worlds: the reliability of strict validation with the flexibility of AI. It's a pattern we've found incredibly useful for maintaining data quality while reducing manual work.

The code shown is simplified but captures the core concepts. Feel free to adapt it to your own validation system!


What data quality challenges is your team facing? I'd love to hear about your approaches in the comments!


This article is a developer port of comment nous utilisons l'IA pour corriger les données chez Swiftgum, initially published on the Swiftgum blog.

Heroku

Built for developers, by developers.

Whether you're building a simple prototype or a business-critical product, Heroku's fully-managed platform gives you the simplest path to delivering apps quickly — using the tools and languages you already love!

Learn More

Top comments (8)

Collapse
 
michael_liang_0208 profile image
Michael Liang

Nice post for AI developers!
Which platform do you think is better, chatgpt or grok?

Collapse
 
titou325 profile image
Titouan Launay

Based on our experience, Gemini provides the most hallucination-free answers at the lower cost. 4o-mini can also be used for small coercions but as we are talking about large prompts, its limited context recall hinders efficiency.

Have yet to try grok on these use cases, but we generally use it for higher temperature/creative thoughts so it may get a little bit off track here. Let us know if you do some benchmarking!

Collapse
 
michael_liang_0208 profile image
Michael Liang

Thanks

Collapse
 
nevodavid profile image
Nevo David

Pretty cool seeing real numbers behind an actual cleanup system – I wish I didn't spend so much time doing this stuff manually myself honestly.

Collapse
 
titou325 profile image
Titouan Launay • Edited

I guess having a good manual validation dataset is always a plus, especially since fine tuning is getting easier and easier. There is however a catch into building these datasets as there needs to be strict rules about what to "correct", vs what to "infer".

In our case of building document processing pipelines, the 2-step process (extraction & correction) needs to be executed in isolation. If we create validation data for the correction step by proofreading the files, we effectively hinder the results and are likelier to have hallucinations.

It all depends on how your process is structured!

Collapse
 
dotallio profile image
Dotallio

Love how you combined strict validation with smart prompt design for the LLM corrections - did you run into any edge cases where the AI struggled even after multiple attempts?

Collapse
 
titou325 profile image
Titouan Launay

There is always going to be edge cases but there are multiple ways to look forward depending on the use case. One main catch is trying to correct longer JSON objects. The schema itself is quite lengthy, and the values can get super long.

The low hanging fruit for this kind of optimisation is to split the object and schema, but a naïve approach can give quite bad results. The following JSON schema and JSON objects are quite badly structured but they illustrate the point well:

"vehicle": {
  "anyOf": [
    { "type": "object", "properties": { "type": { "const": "Car" }, "numberOfSeats": { "maximum": 5 } },  },
    { "type": "object", "properties": { "type": { "const": "Bus" }, "numberOfSeats": { "minimum": 6 } },  }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Incoming data is:

{ "type": "Car", "numberOfSeats": 12 }
Enter fullscreen mode Exit fullscreen mode

The obvious correction here is to switch type "Car" to type "Bus", however if we only focused on the smallest object and its corresponding schema, we are locked into "Car".

There are heuristics that may be important to your use case regarding this. For us, we process each top-level object field separately as they are closely coupled to the way we define extraction schemas and represent a "self-sustaining" unit of data.

Collapse
 
nathan_tarbert profile image
Nathan Tarbert

Pretty cool honestly - been deep in this kind of mess myself and it's always nice to see someone actually push for better solutions instead of just more manual work.

ACI image

ACI.dev: Fully Open-source AI Agent Tool-Use Infra (Composio Alternative)

100% open-source tool-use platform (backend, dev portal, integration library, SDK/MCP) that connects your AI agents to 600+ tools with multi-tenant auth, granular permissions, and access through direct function calling or a unified MCP server.

Check out our GitHub!