Titouan Launay

Posted on May 22

Practical AI: Building a Robust Data Correction System with JSON Schema and LLMs

#ai #typescript #json #ajv

TL;DR

We built a system that combines JSON Schema validation with LLMs to automatically fix malformed data. It's more powerful than regex-based fixes, more reliable than pure AI approaches, and saves our team countless hours of manual data cleanup.

The Problem: Data is Messy

At Swiftgum, we process large volumes of real estate contract data. Much of this comes from OCR or third-party integrations, and it's rarely perfect:

Number fields containing text: "annual_rent": "25,000€ excl. tax"
Dates in various formats: "effective_date": "first of January, 2023"
Domain-specific notation: "duration": "3+6+9" (a common commercial lease structure)

Standard validation would reject these values outright. Manual correction takes hours. So we built something better.

The Architecture: Validation + AI

Here's our solution in pseudocode:

async function fixValueWithAI<T>(
  schema: JSONSchema,
  value: unknown,
  options?: {
    maxAttempts?: number;
    model?: string;
  }
): Promise<T> {
  // Try validating first - maybe it's already valid
  const validationResult = safeValidate(schema, value);
  if (validationResult.valid) {
    return validationResult.value as T;
  }

  // Not valid, let's try AI correction
  const attempts = options?.maxAttempts ?? 2;
  let currentValue = value;

  for (let i = 0; i < attempts; i++) {
    try {
      // Prepare instructions for the LLM
      const prompt = buildCorrectionPrompt({
        schema: stripInternalProperties(schema),
        originalValue: currentValue,
        errors: validationResult.error,
      });

      // Get AI correction
      const aiOutput = await askAI(prompt, options?.model ?? "gpt-4o-mini");

      // Check if AI signals it cannot fix
      const escapeHatchCheck = EscapeHatchSchema.safeParse(aiOutput);
      if (escapeHatchCheck.success) {
        throw new Error("AI indicates it cannot fix the value");
      }

      // Validate AI's proposed fix
      const newValidationResult = safeValidate(schema, aiOutput);
      if (newValidationResult.valid) {
        return newValidationResult.value as T;
      }

      // Still not valid, but use this as the starting point for next attempt
      currentValue = aiOutput;
    } catch (err) {
      // Log the error but continue to next attempt
      logger.warn(`AI correction attempt ${i + 1} failed`, { error: err });
    }
  }

  // All attempts failed
  throw new ValidationError("Could not correct value after multiple attempts");
}

The Secret Sauce: Crafting Effective Prompts

The prompt we send to the LLM is crucial. Here's the template we use:

function buildCorrectionPrompt({ schema, originalValue, errors }) {
  return `
You are a data correction expert. Your task is to fix a JSON value that fails validation.

THE JSON SCHEMA:
${JSON.stringify(schema, null, 2)}

THE ORIGINAL VALUE:
${JSON.stringify(originalValue, null, 2)}

VALIDATION ERRORS:
${JSON.stringify(errors, null, 2)}

INSTRUCTIONS:
1. Analyze the schema requirements and validation errors
2. Transform the original value to make it conform to the schema
3. ONLY fix what's wrong - preserve all other data
4. DO NOT invent values if you don't have enough information
5. If you cannot fix the value without guessing, respond with {"cannotFix": true}
6. Respond ONLY with the fixed JSON object or the cannotFix object

FIXED VALUE:
`;
}

A few key points that make this effective:

Simplified schema: We strip internal properties to focus the LLM on the relevant parts
Clear validation errors: We transform AJV errors into a more readable format
Escape hatch: The {"cannotFix": true} option prevents wild guesses
Multiple attempts: Each correction attempt builds on the previous one

Real-World Example

Here's a real example from our production system:

Schema:

{
  "type": "object",
  "properties": {
    "tenant": {
      "type": "string",
      "description": "Name of the tenant company"
    },
    "annual_rent": {
      "type": "number",
      "minimum": 0,
      "description": "Annual rent in euros"
    },
    "effective_date": {
      "type": "string",
      "format": "date",
      "description": "Start date of the lease"
    },
    "duration": {
      "type": "integer",
      "minimum": 1,
      "description": "Duration of the lease in years"
    }
  },
  "required": ["tenant", "annual_rent", "effective_date"]
}

Original value (from OCR):

{
  "tenant": "SCI Les Oliviers",
  "annual_rent": "25 000€ excl. tax",
  "effective_date": "first of January 2023",
  "duration": "3+6+9"
}

Validation errors:

[
  {
    "path": "/annual_rent",
    "message": "must be number, found: string",
    "value": "25 000€ excl. tax"
  },
  {
    "path": "/effective_date",
    "message": "invalid date format",
    "value": "first of January 2023"
  },
  {
    "path": "/duration",
    "message": "must be integer, found: string",
    "value": "3+6+9"
  }
]

AI-corrected value:

{
  "tenant": "SCI Les Oliviers",
  "annual_rent": 25000,
  "effective_date": "2023-01-01",
  "duration": 3
}

Note how the AI properly:

Extracted the numeric value from the rent string
Converted the textual date to ISO format
Used the first number from the commercial lease notation

Implementation Details

Our tech stack:

OpenAI models (currently gpt-4o-mini-2024-07-18)
Vercel AI SDK for streamlined LLM integration
AJV for JSON Schema validation
Zod for TypeScript-native validation

Here's a simplified version of our validation wrapper:

function safeValidate(schema: JSONSchema, value: unknown): ValidationResult {
  try {
    // Use AJV for validation
    const valid = ajv.validate(schema, value);

    if (valid) {
      return { valid: true, value };
    } else {
      return {
        valid: false,
        error: transformAjvErrors(ajv.errors),
      };
    }
  } catch (err) {
    // Handle errors in the validation process itself
    return {
      valid: false,
      error: [
        {
          path: "",
          message: "Validation process failed",
          value,
        },
      ],
    };
  }
}

// Transform AJV's verbose errors into a more concise format
function transformAjvErrors(errors: Ajv.ErrorObject[]): ErrorDetail[] {
  return errors.map((err) => ({
    path: err.instancePath,
    message: err.message,
    value: getValueAtPath(originalValue, err.instancePath),
  }));
}

Key Learnings

After implementing this in production, we discovered:

LLMs understand JSON Schema natively - they're trained on enough examples to grasp the semantics well
The escape hatch is crucial - {"cannotFix": true} prevents hallucinated data when correction is impossible
Multiple attempts improve success rates - The first pass might fix 2/3 errors, then the second pass can address the remainder
Error costs guide implementation - In our domain, false corrections are more costly than failed corrections, so we err on the side of caution
Prompt design is critical - Clear instructions, simplified schema, and structured error details all improve correction quality

Results

After deploying this system:

85% reduction in manual data corrections
99.2% accuracy on the corrections made automatically
~3 seconds average processing time per correction
Successful handling of complex real estate-specific formats

When to Use This Pattern

This approach shines when:

You have well-defined data schemas
Manual correction is expensive or time-consuming
Data errors follow patterns but aren't simple enough for regex
The cost of incorrect data is high

Conclusion

Combining JSON Schema validation with LLM-based correction gives you the best of both worlds: the reliability of strict validation with the flexibility of AI. It's a pattern we've found incredibly useful for maintaining data quality while reducing manual work.

The code shown is simplified but captures the core concepts. Feel free to adapt it to your own validation system!

What data quality challenges is your team facing? I'd love to hear about your approaches in the comments!

This article is a developer port of comment nous utilisons l'IA pour corriger les données chez Swiftgum, initially published on the Swiftgum blog.

Built for developers, by developers.

Whether you're building a simple prototype or a business-critical product, Heroku's fully-managed platform gives you the simplest path to delivering apps quickly — using the tools and languages you already love!

Learn More

Top comments (8)

Michael Liang • May 22

Nice post for AI developers!
Which platform do you think is better, chatgpt or grok?

Titouan Launay • May 22

Based on our experience, Gemini provides the most hallucination-free answers at the lower cost. 4o-mini can also be used for small coercions but as we are talking about large prompts, its limited context recall hinders efficiency.

Have yet to try grok on these use cases, but we generally use it for higher temperature/creative thoughts so it may get a little bit off track here. Let us know if you do some benchmarking!

Michael Liang • May 22

Thanks

Nevo David • May 22

Pretty cool seeing real numbers behind an actual cleanup system – I wish I didn't spend so much time doing this stuff manually myself honestly.

Titouan Launay • May 23 • Edited

I guess having a good manual validation dataset is always a plus, especially since fine tuning is getting easier and easier. There is however a catch into building these datasets as there needs to be strict rules about what to "correct", vs what to "infer".

In our case of building document processing pipelines, the 2-step process (extraction & correction) needs to be executed in isolation. If we create validation data for the correction step by proofreading the files, we effectively hinder the results and are likelier to have hallucinations.

It all depends on how your process is structured!

Dotallio • May 23

Love how you combined strict validation with smart prompt design for the LLM corrections - did you run into any edge cases where the AI struggled even after multiple attempts?

Titouan Launay • May 23

There is always going to be edge cases but there are multiple ways to look forward depending on the use case. One main catch is trying to correct longer JSON objects. The schema itself is quite lengthy, and the values can get super long.

The low hanging fruit for this kind of optimisation is to split the object and schema, but a naïve approach can give quite bad results. The following JSON schema and JSON objects are quite badly structured but they illustrate the point well:

"vehicle": {
  "anyOf": [
    { "type": "object", "properties": { "type": { "const": "Car" }, "numberOfSeats": { "maximum": 5 } }, … },
    { "type": "object", "properties": { "type": { "const": "Bus" }, "numberOfSeats": { "minimum": 6 } }, … }
  ]
}

Incoming data is:

{ "type": "Car", "numberOfSeats": 12 }

The obvious correction here is to switch type "Car" to type "Bus", however if we only focused on the smallest object and its corresponding schema, we are locked into "Car".

There are heuristics that may be important to your use case regarding this. For us, we process each top-level object field separately as they are closely coupled to the way we define extraction schemas and represent a "self-sustaining" unit of data.

Nathan Tarbert • May 23

Pretty cool honestly - been deep in this kind of mess myself and it's always nice to see someone actually push for better solutions instead of just more manual work.

ACI.dev: Fully Open-source AI Agent Tool-Use Infra (Composio Alternative)

100% open-source tool-use platform (backend, dev portal, integration library, SDK/MCP) that connects your AI agents to 600+ tools with multi-tenant auth, granular permissions, and access through direct function calling or a unified MCP server.

Check out our GitHub!