Trending

#ajv

Latest posts tagged with #ajv on Bluesky

Latest Top
Trending

Posts tagged #ajv

Preview
Comprehensive Guide to JSON Schema Validation Using AJV This article is the second part of a three-part series on JSON Schema, focusing on practical implementation of data validation with AJV (Another JSON Schema Validator). It highlights why AJV is a…

Comprehensive Guide to JSON Schema Validation Using AJV killbait.com/en/comprehen... #technology #jsonschema #ajv #validation

0 0 0 0
Preview
Practical AI: Building a Robust Data Correction System with JSON Schema and LLMs ## TL;DR We built a system that combines JSON Schema validation with LLMs to automatically fix malformed data. It's more powerful than regex-based fixes, more reliable than pure AI approaches, and saves our team countless hours of manual data cleanup. ## The Problem: Data is Messy At Swiftgum, we process large volumes of real estate contract data. Much of this comes from OCR or third-party integrations, and it's rarely perfect: * Number fields containing text: `"annual_rent": "25,000€ excl. tax"` * Dates in various formats: `"effective_date": "first of January, 2023"` * Domain-specific notation: `"duration": "3+6+9"` (a common commercial lease structure) Standard validation would reject these values outright. Manual correction takes hours. So we built something better. ## The Architecture: Validation + AI Here's our solution in pseudocode: async function fixValueWithAI<T>( schema: JSONSchema, value: unknown, options?: { maxAttempts?: number; model?: string; } ): Promise<T> { // Try validating first - maybe it's already valid const validationResult = safeValidate(schema, value); if (validationResult.valid) { return validationResult.value as T; } // Not valid, let's try AI correction const attempts = options?.maxAttempts ?? 2; let currentValue = value; for (let i = 0; i < attempts; i++) { try { // Prepare instructions for the LLM const prompt = buildCorrectionPrompt({ schema: stripInternalProperties(schema), originalValue: currentValue, errors: validationResult.error, }); // Get AI correction const aiOutput = await askAI(prompt, options?.model ?? "gpt-4o-mini"); // Check if AI signals it cannot fix const escapeHatchCheck = EscapeHatchSchema.safeParse(aiOutput); if (escapeHatchCheck.success) { throw new Error("AI indicates it cannot fix the value"); } // Validate AI's proposed fix const newValidationResult = safeValidate(schema, aiOutput); if (newValidationResult.valid) { return newValidationResult.value as T; } // Still not valid, but use this as the starting point for next attempt currentValue = aiOutput; } catch (err) { // Log the error but continue to next attempt logger.warn(`AI correction attempt ${i + 1} failed`, { error: err }); } } // All attempts failed throw new ValidationError("Could not correct value after multiple attempts"); } ## The Secret Sauce: Crafting Effective Prompts The prompt we send to the LLM is crucial. Here's the template we use: function buildCorrectionPrompt({ schema, originalValue, errors }) { return ` You are a data correction expert. Your task is to fix a JSON value that fails validation. THE JSON SCHEMA: ${JSON.stringify(schema, null, 2)} THE ORIGINAL VALUE: ${JSON.stringify(originalValue, null, 2)} VALIDATION ERRORS: ${JSON.stringify(errors, null, 2)} INSTRUCTIONS: 1. Analyze the schema requirements and validation errors 2. Transform the original value to make it conform to the schema 3. ONLY fix what's wrong - preserve all other data 4. DO NOT invent values if you don't have enough information 5. If you cannot fix the value without guessing, respond with {"cannotFix": true} 6. Respond ONLY with the fixed JSON object or the cannotFix object FIXED VALUE: `; } A few key points that make this effective: 1. **Simplified schema** : We strip internal properties to focus the LLM on the relevant parts 2. **Clear validation errors** : We transform AJV errors into a more readable format 3. **Escape hatch** : The `{"cannotFix": true}` option prevents wild guesses 4. **Multiple attempts** : Each correction attempt builds on the previous one ## Real-World Example Here's a real example from our production system: **Schema:** { "type": "object", "properties": { "tenant": { "type": "string", "description": "Name of the tenant company" }, "annual_rent": { "type": "number", "minimum": 0, "description": "Annual rent in euros" }, "effective_date": { "type": "string", "format": "date", "description": "Start date of the lease" }, "duration": { "type": "integer", "minimum": 1, "description": "Duration of the lease in years" } }, "required": ["tenant", "annual_rent", "effective_date"] } **Original value (from OCR):** { "tenant": "SCI Les Oliviers", "annual_rent": "25 000€ excl. tax", "effective_date": "first of January 2023", "duration": "3+6+9" } **Validation errors:** [ { "path": "/annual_rent", "message": "must be number, found: string", "value": "25 000€ excl. tax" }, { "path": "/effective_date", "message": "invalid date format", "value": "first of January 2023" }, { "path": "/duration", "message": "must be integer, found: string", "value": "3+6+9" } ] **AI-corrected value:** { "tenant": "SCI Les Oliviers", "annual_rent": 25000, "effective_date": "2023-01-01", "duration": 3 } Note how the AI properly: * Extracted the numeric value from the rent string * Converted the textual date to ISO format * Used the first number from the commercial lease notation ## Implementation Details Our tech stack: * **OpenAI models** (currently gpt-4o-mini-2024-07-18) * **Vercel AI SDK** for streamlined LLM integration * **AJV** for JSON Schema validation * **Zod** for TypeScript-native validation Here's a simplified version of our validation wrapper: function safeValidate(schema: JSONSchema, value: unknown): ValidationResult { try { // Use AJV for validation const valid = ajv.validate(schema, value); if (valid) { return { valid: true, value }; } else { return { valid: false, error: transformAjvErrors(ajv.errors), }; } } catch (err) { // Handle errors in the validation process itself return { valid: false, error: [ { path: "", message: "Validation process failed", value, }, ], }; } } // Transform AJV's verbose errors into a more concise format function transformAjvErrors(errors: Ajv.ErrorObject[]): ErrorDetail[] { return errors.map((err) => ({ path: err.instancePath, message: err.message, value: getValueAtPath(originalValue, err.instancePath), })); } ## Key Learnings After implementing this in production, we discovered: 1. **LLMs understand JSON Schema natively** - they're trained on enough examples to grasp the semantics well 2. **The escape hatch is crucial** - `{"cannotFix": true}` prevents hallucinated data when correction is impossible 3. **Multiple attempts improve success rates** - The first pass might fix 2/3 errors, then the second pass can address the remainder 4. **Error costs guide implementation** - In our domain, false corrections are more costly than failed corrections, so we err on the side of caution 5. **Prompt design is critical** - Clear instructions, simplified schema, and structured error details all improve correction quality ## Results After deploying this system: * **85% reduction** in manual data corrections * **99.2% accuracy** on the corrections made automatically * **~3 seconds** average processing time per correction * Successful handling of complex real estate-specific formats ## When to Use This Pattern This approach shines when: 1. You have well-defined data schemas 2. Manual correction is expensive or time-consuming 3. Data errors follow patterns but aren't simple enough for regex 4. The cost of incorrect data is high ## Conclusion Combining JSON Schema validation with LLM-based correction gives you the best of both worlds: the reliability of strict validation with the flexibility of AI. It's a pattern we've found incredibly useful for maintaining data quality while reducing manual work. The code shown is simplified but captures the core concepts. Feel free to adapt it to your own validation system! What data quality challenges is your team facing? I'd love to hear about your approaches in the comments! This article is a developer port of comment nous utilisons l'IA pour corriger les données chez Swiftgum, initially published on the Swiftgum blog.
0 0 0 0
Trois tortues sur un caillou dans l'étang, prennent le soleil.

Trois tortues sur un caillou dans l'étang, prennent le soleil.

Tronc "grignoté" par un castor.

Tronc "grignoté" par un castor.

Aujourd'hui j'ai vu des tortues qui prenaient le soleil et un castor qui trouvait que les gens du parc n'avaient pas coupé l'arbre assez court 😉
#AJV

46 2 2 0