Also drop by our another poster, where we show:
Yes, LMs can learn k-hop reasoning; however, it comes at the cost of an exponential increase in training data and linear growth in model depth as k increases; and cirriculum learning can significantly cuts the data needed!
arxiv.org/abs/2505.17923
01.11.2025 17:24
π 0
π 0
π¬ 0
π 0
Come talk to me (not limited to the paper!!) at EMNLP!
This is done partly during a wonderful visit to @mainlp.bsky.social. Many thanks to my amazing collaborators: @pmondorf.bsky.social Silvia Casola, Yuekun Yao, Robert Litschko, @barbaraplank.bsky.social
01.11.2025 17:24
π 1
π 0
π¬ 1
π 0
This project is a wonderful end to PhD:
One of the papers that got me interested in deep learning when I was a master's student is: arxiv.org/abs/1611.03530. I have ever since wished to study memorization - generalization relationship
01.11.2025 17:24
π 1
π 0
π¬ 1
π 0
[Observation 3]
FDA memorization mechanism:
@YNikankin et al. (arxiv.org/abs/2410.21272, a great work!) have shown that LMs solve arithmetics using "a bag of heuristics";
Our models memorize using "outlier heurstics": they subtly shift (right) the learned heurstics to fit noise!
01.11.2025 17:24
π 1
π 0
π¬ 1
π 0
In this example, we show that:
1. Both "Parker" (bridge entity) and "Bella" (the correct answer) are computed within the model, and
2. Removing "Parker" from the model harms the memorization of "Cindy" (the incorrect answer), even when "Cindy" and "Parker" have no connections!
01.11.2025 17:24
π 0
π 0
π¬ 1
π 0
Say our model memorizes the incorrect question-answer pair:
"Who is the mother of the CEO of lunarlabs? Answer: Cindy";
while knowing that
1. "the CEO of lunarlabs is Parker", and that
2. "Parker's mother is Bella" (which leads to the correct answer "Bella" to this question)
01.11.2025 17:24
π 0
π 0
π¬ 1
π 0
[Observation 2]
The computation for the correct labels is NOT independent of that for noisy labels: instead, predicting noisy labels relies on computing noisy labels!
We ablate the correct "bridge entities" in THR, and find that noise memorization is heavily influenced
01.11.2025 17:24
π 0
π 0
π¬ 1
π 0
[Observation 1]
Even after perfect memorization of noisy labels, the computation for correct labels persists within our models!
We find that models still produces the correct labels at eariler layers (red lines, Mem-Corrected) and only **override** them with noisy labels later
01.11.2025 17:24
π 0
π 0
π¬ 1
π 0
[Obervation 0]
on our tasks, generalization happens earlier than memorization: even on training instances of noisy labels (e.g., a wrong addition result in FDA, or a wrong target person entity in THR), our models first produces the correct answers for them
01.11.2025 17:24
π 0
π 0
π¬ 1
π 0
Specifically:
we train GPT2-style LM from scratch on two tasks: four-digit addition (FDA) and two-hop relational reasoning (THR), with 2-10% random label noise injected
Examples:
1. 1357+2473=7143 (FDA)
2. Who is the debtor of the neighbor of Adam? (THR, all facts are known)
01.11.2025 17:24
π 0
π 0
π¬ 1
π 0
How do language models memorize noise while reason impressively well?
Our #EMNLP2025 (poster, Nov 5, 11:00-12:30, Hall C) paper shows that memorization reuses internal mechanisms of generalization, even when they are not related to each other!
arxiv.org/abs/2507.04782
01.11.2025 17:24
π 4
π 2
π¬ 1
π 0