For some reason my account got suspended after posting this. Weird moderation.
Having restored my account, I'm reposting to increase visibility.
For some reason my account got suspended after posting this. Weird moderation.
Having restored my account, I'm reposting to increase visibility.
The paper is accepted at EACL Findings. See you in Rabat! 🇲🇦
Shoutout to @mlkukic.bsky.social (just started his MS, hire him!) @ddaviddukic.bsky.social @mtutek.bsky.social and sensei Jan Šnajder for this cute collaboration.
📜https://arxiv.org/abs/2601.17585
💻https://github.com/takelab/repetition-sl
We establish multi-fold repetition with early exiting as a viable strategy for decoder-as-encoder adaptation, one that does not require complex architectural modifications or extensive training. 💰
For Mistral-7B, we find that embeddings from layer 24 (out of 32) can even outperform those at the last layer, while matching the processing time of the input sequence with no repetitions.
To counteract the computational overhead, we experiment with early exiting, by using the representations in the intermediate layers of the models.💡
However, the performance gains saturate around 4 repetitions. Also – adding many repetitions incurs computational costs. 🤨
Indeed, we observe performance gains over SotA baselines such as removing the causal mask on all the layers in the model (full unmasking) or only the ones in the middle (middle unmasking), and SotA encoder-only models (ModernBERT and RoBERTa).
Therefore, additional repetitions bring the model closer to a balanced ratio of left- and right-context information throughout the entire input sequence. ⚖️
We demonstrate the utility of increased repetitions on sequence labeling tasks such as NER or aspect-based sentiment analysis. 📈
We focus on token-level tasks as they require bidirectional context at each token, something decoder-only models lack.
Additional repetitions increase the proportion of bidirectional blocks, and with a little bit of high school math, it is easy to see that this proportion approaches 1 at infinite repetitions, thus resembling an encoder-only model.
💡Thus, we wanted to have a look at what happens if a model is fine-tuned to utilize additional repetitions. In theory, repeating a sequence once leads to a bidirectional block in the attention matrix.
Previous works found that performance gains dissipate at higher repetition counts. 🔁🔁🔁...
We found this phenomenon counterintuitive since additional repetitions effectively increase the processing capacity of the model.
We already know prompt repetition is a handy hack to improve a decoder-only LM’s performance as it allows the model to “see” bidirectionally, an ability otherwise suppressed by the causal mask.
But what happens if we increase the number of repetitions? 🤔🧵 @eaclmeeting.bsky.social #EACL2026
Very honored to be one out of seven outstanding papers at this years' EMNLP :)
Huge thanks to my amazing collaborators @fatemehc.bsky.social @anamarasovic.bsky.social @boknilev.bsky.social , this would not have been possible without them!
Back from #ICML2025 🛬, and off 🚄 to Norrköping 🇸🇪 for #ic2s2
CLAN (cs.au.dk/~clan/) members are presenting 2 papers: 1 spotlight and 1 oral. See 🧵for posters and summaries
👋 Reach out to chat about observational studies, causality, LLM agents, human-centered AI etc.
We did a cool group project exploring diachronic embeddings for Croatian and found that (among other things) embeddings trained on later periods are more positive when plugged into models trained on earlier time periods.
Check out the thread 🧵 & come talk to us in Vienna about this & other works 🍻