Nur Lan (@nurikolan)

📄 New paper: “A Minimum Description Length Approach to Regularization in Neural Networks”
with Orr Well, Emmanuel Chemla, @rkatzir.bsky.social and @nurikolan.bsky.social

We explore why neural networks often struggle with simple, structured tasks.
Spoiler: our regularizers might be the problem.

🧵

24.05.2025 16:01 👍 3 🔁 2 💬 1 📌 0

Sorry, missed the G&H ref!

Re: precision I meant that there's a chance that the "compressed" network is in practice not a bottleneck if it can store the same/approx. solution as the larger net in high precision weights.
Would be interesting to see how it does when quantized (either at test/train)

29.11.2024 23:16 👍 1 🔁 0 💬 1 📌 0

Hi, interesting work. Did you try limiting the precision of the "compressed" network? Number of params is a very crude proxy for the actual information capacity.

See e.g. aclanthology.org/2024.acl-lon... and doi.org/10.1162/tacl...

and very similar work by Gaier & Ha 2019 arxiv.org/abs/1906.04358

29.11.2024 17:56 👍 1 🔁 0 💬 2 📌 0

⭐️🗞️ Accepted to ACL 2024 main conference! #ACL2024NLP

Neural nets can in theory learn formal languages such as aⁿbⁿ & Dyck. Yet no one ever finds such nets using standard techniques. Why?

We suggest that the culprit might have been the objective function all along 👇

arxiv.org/abs/2402.10013

17.06.2024 18:54 👍 4 🔁 0 💬 0 📌 0

Minimum Description Length Recurrent Neural Networks Abstract. We train neural networks to optimize a Minimum Description Length score, that is, to balance between the complexity of the network and its accuracy at a task. We show that networks optimizin...

Our findings are in line with works such as El-Naggar et al. (2023) who found similar shortcomings of common objectives for other archs:
proceedings.mlr.press/v217/el-nagg...

As well as with our MDL RNNs who achieve perfect generalization on aⁿbⁿ, Dyck-1, etc:
direct.mit.edu/tacl/article...

3/3

17.02.2024 18:19 👍 0 🔁 0 💬 0 📌 0

Training an LSTM for aⁿbⁿ using the cross-entropy loss consistently leads to imperfect counting, while using Minimum Description Length (MDL) leads to a provably perfect counting net.

We build an optimal aⁿbⁿ LSTM based on @gail_w et al. (2018) and find that it is not an optimum of the standard cross-entropy loss, even with regularization terms that are expected to lead good generalization (L1/L2).

Meta-heuristics (early stop, dropout) don't help either.

2/3

17.02.2024 18:14 👍 0 🔁 0 💬 1 📌 0

We build an optimal aⁿbⁿ LSTM based on Weiss et al. (2018), and find that it does not lie at optima of standard loss terms (cross-entropy with/without L1/L2). Moving to the Minimum Description Length objective (MDL) aligns the network with an optimum of the loss.

🧪🗞️ New paper with Emmanuel Chemla and @rkatzir.bsky.social:

Neural nets offer good approximation but consistently fail to generalize perfectly, even when perfect solutions are proved to exist.

We check whether the culprit might be their training objective.

arxiv.org/abs/2402.10013

17.02.2024 18:13 👍 0 🔁 0 💬 1 📌 1

GitHub - 0xnurl/gpts-cant-count: Demo of even the most advanced LLMs' inability to handle basic arit... Demo of even the most advanced LLMs' inability to handle basic arithmetic. - GitHub - 0xnurl/gpts-cant-count: Demo of even the most advanced LLMs' inability to handle basic arithmetic.

🎲 GPTs can't count – new simple demo of LLMs' very partial arithmetics.

github.com/0xnurl/gpts-...

15.12.2023 21:14 👍 2 🔁 0 💬 0 📌 0

There is in Al today a tendency toward flashy, splashy domains--that is, toward developing programs that can do such things as medical diagnosis, geological consultation (for oil prospecting), designing of experiments in molecular biology, molecular spectroscopy, configuring of large computer systems, designing ofVLSI circuits, and on and on.Yet there is no program that has common sense; no program that learns things that it has not been explicitly taughthow to learn; no program that can recover gracefully from itsown errors. The "artificial expertise" programs that do exist are rigid, brittle, inflexible. Like chess programs, they may serve a useful intellectual or even practical purpose, butdespite much fanfare, they are not shedding much lighto nhuman intelligence. Mostly, they are being developed simply because various agencies or industries fund them. This doesnotfollow the traditional pattern of basic science. That pattern is to try to isolate a phenomenon, to reduce it to its si

Douglas Hofstadter on toy tasks, in Waking Up from the Boolean Dream, 1982

23.11.2023 10:09 👍 0 🔁 0 💬 0 📌 0

We take this to show that recent claims about LLMs undermining the argument from the poverty of the stimulus are premature.

21.11.2023 16:18 👍 0 🔁 0 💬 0 📌 0

Surprisal values for the sentence "I know who John met recently and is going to annoy soon", and its ungrammatical variant "I know who John met recently and is going to annoy you soon". GPT-2 and GPT-j wrongly assign higher probabilities to the ungrammatical continuation.

We now test a much larger battery of models on important syntactic phenomena: across-the-board movement and parasitic gaps.

Using cases where humans have clear acceptability judgements, we find that all models systematically fail to assign higher probabilities to grammatical continuations.

21.11.2023 16:15 👍 0 🔁 0 💬 1 📌 0

Accuracy figure for large language models tested on across-the-board sentences

⚡ 🗞️ New up-to-date version of Large Language Models and the Argument from the Poverty of the Stimulus, work with Emmanuel Chemla and @rkatzir.bsky.social:

ling.auf.net/lingbuzz/006...

21.11.2023 16:14 👍 1 🔁 0 💬 1 📌 0

We find that minimizing the algorithmic complexity of the net (w/ MDL) results in better generalization, using significantly less data.

The second-best net, a Memoy-Augmented RNN by Suzgun et al., shows that expressive power is important for GI, but isn't enough for little data.

02.10.2023 09:43 👍 0 🔁 0 💬 0 📌 0

Why a new benchmark?

A long line of work tested GI in different ways.

Many showed nets generalizing to some extent beyond training, but usually did not explain why generalization stopped at arbitrary points – why would a net get a¹⁰¹⁷b¹⁰¹⁷ right but a¹⁰¹⁸b¹⁰¹⁸ wrong?

02.10.2023 09:41 👍 0 🔁 0 💬 1 📌 0

We introduce BLISS - a Benchmark for Language Induction from Small Sets.

The benchmark assigns a generalization index to a model based on how much it generalizes from how little training data.

The initial release includes languages such as aⁿbⁿ, aⁿbᵐcⁿ⁺ᵐ, and Dyck 1-2.

02.10.2023 09:40 👍 0 🔁 0 💬 1 📌 0

Grammar induction (GI) involves learning a formal grammar from a finite, often small, sample of a typically infinite language. To do this, a model must be able to generalize well.

Humans do this remarkably well based on very little data. What about neural nets?

02.10.2023 09:39 👍 0 🔁 0 💬 1 📌 0

GitHub - taucompling/bliss: 🧘 BLISS – a Benchmark for Language Induction from Small Sets 🧘 BLISS – a Benchmark for Language Induction from Small Sets - GitHub - taucompling/bliss: 🧘 BLISS – a Benchmark for Language Induction from Small Sets

How well can neural networks generalize from how little data?

New work with Emmanuel Chemla and Roni Katzir:

Benchmark:
github.com/taucompling/...

Paper:
aclanthology.org/2023.clasp-1...

🧵

02.10.2023 09:38 👍 3 🔁 0 💬 1 📌 0

Nur Lan

Latest posts by Nur Lan @nurikolan