Aécio Santos (@aeciosan)

GDC-SM: The GDC Schema Matching Benchmark GDC-SM is a schema matching evaluation benchmark based on a real data harmonization scenario that is common in biomedical research: pooling datasets from multiple studies to increase the number of pat...

We also introduce GDC-SM, a new benchmark for evaluating schema matching algorithms on real-world biomedical data. Labeling this dataset was a huge collaborative effort between our group and colleagues at the NYU School of Medicine. The benchmark is available on Zenodo: zenodo.org/records/1496...

07.08.2025 06:03 👍 1 🔁 0 💬 0 📌 0

Our new paper "Magneto: Combining Small and Large Language Models for Schema Matching" has just been published in the new issue of #PVLDB! The paper introduces a new framework that combines both small and large language models for effective schema matching.

www.vldb.org/pvldb/vol18/...

07.08.2025 06:03 👍 1 🔁 0 💬 1 📌 0

@sigmod2025.bsky.social #SIGMOD #SIGMID2025

21.06.2025 17:00 👍 0 🔁 0 💬 0 📌 0

We also introduce Harmonia, our proof-of-concept prototype that implements this vision. It orchestrates specialized data integration algorithms and works with the user to create reproducible pipelines, boosting schema matching F1-score from 0.78 to 1.00 in our preliminary evaluation! #AI #LLMAgents

21.06.2025 16:34 👍 0 🔁 0 💬 1 📌 0

Data harmonization is a major bottleneck in many scientific fields. In our new paper, we present a vision for using LLM-based agents to streamline this slow, manual process of reconciling mismatched schemas and terms.

21.06.2025 16:34 👍 0 🔁 0 💬 1 📌 0

Interactive Data Harmonization with LLM Agents Data harmonization is an essential task that entails integrating datasets from diverse sources. Despite years of research in this area, it remains a time-consuming and challenging task due to schema m...

📢 Tomorrow, I'll be presenting our new paper on LLM-based agents for interactive data integration at the #SIGMOD2025 NOVAS workshop. I'll also be in Berlin for the whole week, so please reach out if you'd like to chat or hang out!

Paper: arxiv.org/abs/2502.07132

21.06.2025 15:53 👍 6 🔁 0 💬 1 📌 0

How Do Transformers Learn Variable Binding in Symbolic Programs? YouTube video by Raphaël Millière

Transformer-based neural networks achieve impressive performance on coding, math & reasoning tasks that require keeping track of variables and their values. But how can they do that without explicit memory?

📄 Our new ICML paper investigates this in a synthetic setting!
🎥 youtu.be/Ux8iNcXNEhw
🧵 1/13

03.06.2025 13:18 👍 52 🔁 8 💬 1 📌 1

2024 Symposium on Simplicity in Algorithms (SOSA) | Simple Analysis of Priority Sampling Abstract We prove a tight upper bound on the variance of the priority sampling method (aka sequential Poisson sampling). Our proof is significantly shorter and simpler than the original proof given by...

Maybe you will find this one-page proof for priority sampling, with applications to distinct elements and inner product sketches convenient to cover in class: epubs.siam.org/doi/abs/10.1... (full disclosure: I'm co-author and proofs are mainly due to Daliri and Musco)

13.03.2025 15:14 👍 2 🔁 0 💬 0 📌 0

Models are sensitive to minor changes in format, e.g., simply repeating the column name multiple times led to improvements in zero-shot settings. However, fine tuning seems to decrease the performance differences.

06.03.2025 13:24 👍 0 🔁 0 💬 0 📌 0

Magneto: Combining Small and Large Language Models for Schema Matching Recent advances in language models opened new opportunities to address complex schema matching tasks. Schema matching approaches have been proposed that demonstrate the usefulness of language models, ...

Very interesting! We also experimented with different column serialization formats and found somewhat similar results in a schema matching task (arxiv.org/abs/2412.08194).

06.03.2025 13:23 👍 1 🔁 0 💬 1 📌 0

The Data Management for End-to-End Machine Learning workshop (@deem-workshop.bsky.social) will be back at #SIGMOD2025! ✨

🔗 Check out the CfP: deem-workshop.github.io
📝 Submission deadline: March 21
📢 Notifications: April 25

Join us for the 9th edition in Berlin!

#DEEM2025

07.02.2025 20:58 👍 7 🔁 4 💬 1 📌 2

a recent paper discusses this: db.cs.cmu.edu/papers/2024/... the main reason for graph DBs success is the limitation of SQL for querying graphs, although relational DBs seem to be catching up since the addition of property graphs in the latest SQL 2023 standard.

17.12.2024 01:51 👍 2 🔁 0 💬 0 📌 0

Table foundation models for analytics Deep-learning typically does not outperform tree-based models on tabular data. Often this may be explained by the small size of such datasets. For image…

Slides for "Table Foundation Models"

I explain why these models can strongly outperform tree-based models, what are the intuitions,
hopefully pointing to ways forward for more improvement

speakerdeck.com/gaelvaroquau...

15.12.2024 22:43 👍 81 🔁 13 💬 3 📌 2

@madelonhulsebos.bsky.social kicking off the 3rd Table Representation Learning workshop (@trl-research.bsky.social) at NeurIPS 2024. First keynote by @gaelvaroquaux.bsky.social.

14.12.2024 16:56 👍 10 🔁 3 💬 0 📌 0

Aécio Santos

Latest posts by Aécio Santos @aeciosan