π New paper: βA Minimum Description Length Approach to Regularization in Neural Networksβ
with Orr Well, Emmanuel Chemla, @rkatzir.bsky.social and @nurikolan.bsky.social
We explore why neural networks often struggle with simple, structured tasks.
Spoiler: our regularizers might be the problem.
π§΅
24.05.2025 16:01
π 3
π 2
π¬ 1
π 0
Sorry, missed the G&H ref!
Re: precision I meant that there's a chance that the "compressed" network is in practice not a bottleneck if it can store the same/approx. solution as the larger net in high precision weights.
Would be interesting to see how it does when quantized (either at test/train)
29.11.2024 23:16
π 1
π 0
π¬ 1
π 0
Hi, interesting work. Did you try limiting the precision of the "compressed" network? Number of params is a very crude proxy for the actual information capacity.
See e.g. aclanthology.org/2024.acl-lon... and doi.org/10.1162/tacl...
and very similar work by Gaier & Ha 2019 arxiv.org/abs/1906.04358
29.11.2024 17:56
π 1
π 0
π¬ 2
π 0
βοΈποΈ Accepted to ACL 2024 main conference! #ACL2024NLP
Neural nets can in theory learn formal languages such as aβΏbβΏ & Dyck. Yet no one ever finds such nets using standard techniques. Why?
We suggest that the culprit might have been the objective function all along π
arxiv.org/abs/2402.10013
17.06.2024 18:54
π 4
π 0
π¬ 0
π 0
Minimum Description Length Recurrent Neural Networks
Abstract. We train neural networks to optimize a Minimum Description Length score, that is, to balance between the complexity of the network and its accuracy at a task. We show that networks optimizin...
Our findings are in line with works such as El-Naggar et al. (2023) who found similar shortcomings of common objectives for other archs:
proceedings.mlr.press/v217/el-nagg...
As well as with our MDL RNNs who achieve perfect generalization on aβΏbβΏ, Dyck-1, etc:
direct.mit.edu/tacl/article...
3/3
17.02.2024 18:19
π 0
π 0
π¬ 0
π 0
Training an LSTM for aβΏbβΏ using the cross-entropy loss consistently leads to imperfect counting, while using Minimum Description Length (MDL) leads to a provably perfect counting net.
We build an optimal aβΏbβΏ LSTM based on @gail_w et al. (2018) and find that it is not an optimum of the standard cross-entropy loss, even with regularization terms that are expected to lead good generalization (L1/L2).
Meta-heuristics (early stop, dropout) don't help either.
2/3
17.02.2024 18:14
π 0
π 0
π¬ 1
π 0
We build an optimal aβΏbβΏ LSTM based on Weiss et al. (2018), and find that it does not lie at optima of standard loss terms (cross-entropy with/without L1/L2).
Moving to the Minimum Description Length objective (MDL) aligns the network with an optimum of the loss.
π§ͺποΈ New paper with Emmanuel Chemla and @rkatzir.bsky.social:
Neural nets offer good approximation but consistently fail to generalize perfectly, even when perfect solutions are proved to exist.
We check whether the culprit might be their training objective.
arxiv.org/abs/2402.10013
17.02.2024 18:13
π 0
π 0
π¬ 1
π 1
There is in Al today a tendency toward flashy, splashy domains--that is, toward developing programs that can do such things as medical diagnosis, geological consultation (for oil prospecting), designing of experiments in molecular biology, molecular spectroscopy, configuring of large computer systems, designing ofVLSI circuits, and on and on.Yet there is no program that has common sense; no program that learns things that it has not been explicitly taughthow to learn; no program that can recover gracefully from itsown errors. The "artificial expertise" programs that do exist are rigid, brittle, inflexible. Like chess programs, they may serve a useful intellectual or even practical purpose, butdespite much fanfare, they are not shedding much lighto nhuman intelligence. Mostly, they are being developed simply because various agencies or industries fund them. This doesnotfollow the traditional pattern of basic science. That pattern is to try to isolate a phenomenon, to reduce it to its si
Douglas Hofstadter on toy tasks, in Waking Up from the Boolean Dream, 1982
23.11.2023 10:09
π 0
π 0
π¬ 0
π 0
We take this to show that recent claims about LLMs undermining the argument from the poverty of the stimulus are premature.
21.11.2023 16:18
π 0
π 0
π¬ 0
π 0
Surprisal values for the sentence "I know who John met recently and is going to annoy soon", and its ungrammatical variant "I know who John met recently and is going to annoy you soon". GPT-2 and GPT-j wrongly assign higher probabilities to the ungrammatical continuation.
We now test a much larger battery of models on important syntactic phenomena: across-the-board movement and parasitic gaps.
Using cases where humans have clear acceptability judgements, we find that all models systematically fail to assign higher probabilities to grammatical continuations.
21.11.2023 16:15
π 0
π 0
π¬ 1
π 0
Accuracy figure for large language models tested on across-the-board sentences
β‘ ποΈ New up-to-date version of Large Language Models and the Argument from the Poverty of the Stimulus, work with Emmanuel Chemla and @rkatzir.bsky.social:
ling.auf.net/lingbuzz/006...
21.11.2023 16:14
π 1
π 0
π¬ 1
π 0
We find that minimizing the algorithmic complexity of the net (w/ MDL) results in better generalization, using significantly less data.
The second-best net, a Memoy-Augmented RNN by Suzgun et al., shows that expressive power is important for GI, but isn't enough for little data.
02.10.2023 09:43
π 0
π 0
π¬ 0
π 0
Why a new benchmark?
A long line of work tested GI in different ways.
Many showed nets generalizing to some extent beyond training, but usually did not explain why generalization stopped at arbitrary points β why would a net get aΒΉβ°ΒΉβ·bΒΉβ°ΒΉβ· right but aΒΉβ°ΒΉβΈbΒΉβ°ΒΉβΈ wrong?
02.10.2023 09:41
π 0
π 0
π¬ 1
π 0
We introduce BLISS - a Benchmark for Language Induction from Small Sets.
The benchmark assigns a generalization index to a model based on how much it generalizes from how little training data.
The initial release includes languages such as aβΏbβΏ, aβΏbα΅cβΏβΊα΅, and Dyck 1-2.
02.10.2023 09:40
π 0
π 0
π¬ 1
π 0
Grammar induction (GI) involves learning a formal grammar from a finite, often small, sample of a typically infinite language. To do this, a model must be able to generalize well.
Humans do this remarkably well based on very little data. What about neural nets?
02.10.2023 09:39
π 0
π 0
π¬ 1
π 0