GDC-SM: The GDC Schema Matching Benchmark
GDC-SM is a schema matching evaluation benchmark based on a real data harmonization scenario that is common in biomedical research: pooling datasets from multiple studies to increase the number of pat...
We also introduce GDC-SM, a new benchmark for evaluating schema matching algorithms on real-world biomedical data. Labeling this dataset was a huge collaborative effort between our group and colleagues at the NYU School of Medicine. The benchmark is available on Zenodo: zenodo.org/records/1496...
07.08.2025 06:03
👍 1
🔁 0
💬 0
📌 0
Our new paper "Magneto: Combining Small and Large Language Models for Schema Matching" has just been published in the new issue of #PVLDB! The paper introduces a new framework that combines both small and large language models for effective schema matching.
www.vldb.org/pvldb/vol18/...
07.08.2025 06:03
👍 1
🔁 0
💬 1
📌 0
@sigmod2025.bsky.social #SIGMOD #SIGMID2025
21.06.2025 17:00
👍 0
🔁 0
💬 0
📌 0
We also introduce Harmonia, our proof-of-concept prototype that implements this vision. It orchestrates specialized data integration algorithms and works with the user to create reproducible pipelines, boosting schema matching F1-score from 0.78 to 1.00 in our preliminary evaluation! #AI #LLMAgents
21.06.2025 16:34
👍 0
🔁 0
💬 1
📌 0
Data harmonization is a major bottleneck in many scientific fields. In our new paper, we present a vision for using LLM-based agents to streamline this slow, manual process of reconciling mismatched schemas and terms.
21.06.2025 16:34
👍 0
🔁 0
💬 1
📌 0
How Do Transformers Learn Variable Binding in Symbolic Programs?
YouTube video by Raphaël Millière
Transformer-based neural networks achieve impressive performance on coding, math & reasoning tasks that require keeping track of variables and their values. But how can they do that without explicit memory?
📄 Our new ICML paper investigates this in a synthetic setting!
🎥 youtu.be/Ux8iNcXNEhw
🧵 1/13
03.06.2025 13:18
👍 52
🔁 8
💬 1
📌 1
Models are sensitive to minor changes in format, e.g., simply repeating the column name multiple times led to improvements in zero-shot settings. However, fine tuning seems to decrease the performance differences.
06.03.2025 13:24
👍 0
🔁 0
💬 0
📌 0
The Data Management for End-to-End Machine Learning workshop (@deem-workshop.bsky.social) will be back at #SIGMOD2025! ✨
🔗 Check out the CfP: deem-workshop.github.io
📝 Submission deadline: March 21
📢 Notifications: April 25
Join us for the 9th edition in Berlin!
#DEEM2025
07.02.2025 20:58
👍 7
🔁 4
💬 1
📌 2
a recent paper discusses this: db.cs.cmu.edu/papers/2024/... the main reason for graph DBs success is the limitation of SQL for querying graphs, although relational DBs seem to be catching up since the addition of property graphs in the latest SQL 2023 standard.
17.12.2024 01:51
👍 2
🔁 0
💬 0
📌 0
@madelonhulsebos.bsky.social kicking off the 3rd Table Representation Learning workshop (@trl-research.bsky.social) at NeurIPS 2024. First keynote by @gaelvaroquaux.bsky.social.
14.12.2024 16:56
👍 10
🔁 3
💬 0
📌 0