Interesting, will check it out! Thanks for the recommendation.
Interesting, will check it out! Thanks for the recommendation.
What's the best way to monitor LLMs that use the Gemini API?
I used to use langfuse but it doesn't seem work as nicely as it does with openai.
In this post you'll learn how:
1. Build a simple benchmark to evaluate the performance of your models
2. How a single in-context examples allowed 4o-mini to out perform 4o
3. How to simple improve model quality, and latency at the same time.
Check it out!
www.limai.io/blog/example
Using Few Shot examples to boost LLM data extraction by over 50%?
If you spent countless hours fine-tuning prompts, testing different parsing libraries, and trying to craft perfect solutions only to get mediocre results, this is for you.
My 3 mantras to stay sane as an entrepreneur.
Always visible in my desk.
I should probably have a nicer version framed or something, but hey, who has time for that? ๐
Yes, there are so many things going into the "real eval" that makes it super hard to properly capture.
Ohh nice! AlthoughI think that's a bit too much for my skill level ๐คฃ
Want to dive into the details?
Check out our full notebook for the code, results, and how we caught hallucinated outputs: github.com/limai-io/de...
Or letโs chat! DM me or email bruno@limai.io to discuss how we can help build robust pipelines for your business. ๐
The Takeaway
Vision-based models are powerful, but validation frameworks are critical for reliable results.
๐ก If youโre building data pipelines, combine extraction with validation to ensure accuracy and trust.
Key Results
โ
Vision models like Gemini handled layouts flexibly.
โ
Validation caught hallucinations and ensured data accuracy.
โ
Trustworthiness increased for complex documents like utility bills.
How It Works
โข Extract raw text using a PDF reader.
โข Validate each extracted value (e.g., โ160.69 โฌโ) by searching for it in the raw text.
โข Flag values that donโt match as potential hallucinations.
We combined:
1๏ธโฃ Vision-based extraction to handle complex layouts.
2๏ธโฃ Instructor-powered validation to cross-check extracted values against raw text from PDFs.
This ensured data was grounded in reality, not hallucinated.
While vision models excel at "reading" layouts, they sometimes invent data.
E.g., instead of extracting "2.983 kW" for contracted power, the model returned "2.0 kW"โa made-up value. ๐ฌ
How do we prevent this?
Vision-based extraction is becoming the most promising path forward for Document AI.
These models handle complex layouts, tables, and multimodal inputs nativelyโfar beyond what rule-based parsing can achieve. But they also have challenges.
๐ Preventing Hallucinations in Vision-Based Data Extraction
Vision models are coming up as the best way to deal with documents with complex layouts. On the flip side, they are more likely to hallucinate results.
How can we address that? With OCR based data validations. ๐
It feels like chess engines are so powerful now that they become a bit useless in chess commentary. Even GMs can't make sense of the eval bar sometimes. It would be better maybe to have a more "human" eval bar that actually helps the audience and commentators.
I love how chess player assign so much meaning, personality, and purpose to chess pieces throughout games. So much passion and emotions on a board game.
Super excited about PydanticAI. Looking forward to taking it out for a spin.
That's an interesting question. The dataset I have is not big enough to try that. I suspect that indeed at some point it will start to regress.
100%, more so when you have models like Gemini's family in which you can really put A LOT in the context window.
If youโre curious about how this approach can work for you, letโs chat!
Weโre offering free consulting calls this month to help businesses optimize their AI strategies.
๐ฉ bruno@limai.io or DM me!
Check it out here: https://www.limai.io/blog/example
In our latests post we break down:
โ
How we built a simple test dataset to evaluate accuracy.
โ
Why adding examples worked so well (and why you should try it).
โ
How this influenced our product's UX/UI strategy.
Thatโs when we tried something so simple it felt obvious in hindsight: we added an example. The results were staggering:
โข With a small model plus the example, accuracy leaped from 61% to 97%.
โข We achieved this without fine-tuning or complex parsing techniques.
Even after a lot of work on prompt engineering and trying out parsing libraries our results were stuck at 61%-80% accuracyโnot enough for reliable use.
Czech utility bills. These documents had:
๐ Non-English text (a hurdle for many LLMs)
๐งฎ Values that needed to be calculated (e.g., summing multiple rows for Heating or Cooling)
๐ฒ A mix of other fields like dates, addresses, and contracts details
While building Limai 's data extraction product, we faced a tough challenge for a proof concept with a potential client: extracting complex data from
๐ [NEW POST] Show, Donโt Tell: How Dynamic Examples Boosted Accuracy from 61% to 97%
Ever spent hours fine-tuning prompts or testing document parsing libraries, only to end up with meh results? What if I told you that one simple change could drastically improve your results?
https://arxiv.org/abs/2310.11244
Interesting paper on Entity Matching using LLMs. I think I'll work on a demo of this soon.