Did you know that LLMs suffer from serious mode collapse?
For example, if you ask models to tell you a joke, they almost always tell you the same joke? This is true across samples and even across model families!
Why does this happen? Can we improve it?
08.10.2025 14:22
π 4
π 2
π¬ 1
π 0
A scatter plot comparing language models by performance (y-axis, measured in average performance on 10 benchmarks) versus training computational cost (x-axis, in approximate FLOPs). The plot shows OLMo 2 models (marked with stars) achieving Pareto-optimal efficiency among open models, with OLMo-2-13B and OLMo-2-7B sitting at the performance frontier relative to other open models like DCLM, Llama 3.1, StableLM 2, and Qwen 2.5. The x-axis ranges from 4x10^22 to 2x10^24 FLOPs, while the y-axis ranges from 35 to 70 benchmark points.
Excited to share OLMo 2!
π 7B and 13B weights, trained up to 4-5T tokens, fully open data, code, etc
π better architecture and recipe for training stability
π‘ staged training, with new data mix Dolminoπ added during annealing
π¦ state-of-the-art OLMo 2 Instruct models
#nlp #mlsky
links belowπ
26.11.2024 20:59
π 68
π 12
π¬ 1
π 1
A photo of Boulder, Colorado, shot from above the university campus and looking toward the Flatirons.
I'm recruiting 1-2 PhD students to work with me at the University of Colorado Boulder! Looking for creative students with interests in #NLP and #CulturalAnalytics.
Boulder is a lovely college town 30 minutes from Denver and 1 hour from Rocky Mountain National Park π
Apply by December 15th!
19.11.2024 10:38
π 303
π 136
π¬ 9
π 12
Abhilasha Ravichander - Home
β¨I am on the faculty job market in the 2024-2025 cycle!β¨
My research centers on advancing Responsible AI, specifically enhancing factuality, robustness, and transparency in AI systems.
If you have relevant positions, let me know! lasharavichander.github.io Please share/RT!
11.11.2024 14:23
π 51
π 22
π¬ 2
π 1
Why and when do preference annotators disagree? And how do reward models + LLM-as-Judge evaluators handle disagreements?
Michael explored these questions in a new β¨preprintβ¨ from his @ai2.bsky.social internship with me!
07.11.2024 17:38
π 29
π 8
π¬ 1
π 1
ArxivDIGESTables: Synthesizing Scientific Literature into Tables using Language Models
Benjamin Newman, Yoonjoo Lee, Aakanksha Naik, Pao Siangliulue, Raymond Fok, Juho Kim, Daniel S Weld, Joseph Chee Chang, Kyle Lo. Proceedings of the 2024 Conference on Empirical Methods in Natural Lang...
This is work with Yoonjoo Lee, @arnaik19.bsky.social @paopow.bsky.social, @juhokim.bsky.social, Dan Weld, @josephc.bsky.social, and @kylelo.bsky.social
at S2 @ai2.bsky.social, UW CSE and KAIST
For more, check out our
Dataset: github.com/bnewm0609/ar...
Paper: aclanthology.org/2024.emnlp-m...
11.11.2024 17:37
π 5
π 0
π¬ 0
π 0
Two plots of recall versus threshold for determining a match: one for GPT-3.5 Turbo and another for Mixtral 8x22B. There are five lines in each plot. Each line travels from the top left to bottom right of the plot with y-intercepts that are generally in increasing order by the following types of context: generated caption, baseline, gold caption, in-context examples, caption + in-text references.
We also find that providing more table context (captions, in-text references) to models leads to higher recall when generating columns but does not help when generating values.
11.11.2024 17:37
π 1
π 0
π¬ 1
π 0
A plot of recall versus threshold for determining a match between column headers. Llama3 has the highest recall because it hallucinates matches, but Sentence Transformers does better.
We find that using decontextualization with SBERT leads to a better evaluator than Llama 3, which hallucinates alignments.
11.11.2024 17:37
π 2
π 0
π¬ 1
π 0
A diagram showing two steps of table generation. There is text that says "Step 1: Schema Generation" with an arrow pointing to the column headers of a generated table. Under it, there is text that says "Step 2: Value Generation" with an arrow pointing to the body of the generated table.
We propose a two-step procedure for generating tables given the input papers:
1οΈβ£ Generate the schemas (sets of columns)
2οΈβ£ Fill in the values.
11.11.2024 17:37
π 0
π 0
π¬ 1
π 0
An example literature review table with four rows and four columns. Each row is a paper (labeled Paper 1, Paper 2, etc.). Each column is a different aspect: ("Dataset", "Size", "Task", and "Annotations").
This table generation task takes as input multiple papers, and synthesizes them into a single output table. We collect a dataset of such tables and associated papers, and augment the tables with additional context such as their captions and in-text references.
11.11.2024 17:37
π 0
π 0
π¬ 1
π 0
A screenshot of the first page of the paper discussed in the thread. Figure 1 contains a set of three cartoon papers with related text highlighted in three different colors. To its left, there's an arrow pointing to a cartoon table with a column corresponding to each color and a row corresponding to each paper.
β¨EMNLP Paper! β¨
Have you ever constructed a table to organize your literature review process? Can we use LMs to generate these automatically?
We are excited to present ArxivDIGESTables π½οΈ a study of collecting, generating, and evaluating π scientific literature review tables π!
11.11.2024 17:37
π 29
π 2
π¬ 2
π 3