Christopher Schröder (@cschroeder)

Reassessing Active Learning Adoption in Contemporary NLP: A Community Survey Supervised learning relies on data annotation which usually is time-consuming and therefore expensive. A longstanding strategy to reduce annotation costs is active learning, an iterative process, in w...

Preprint: arxiv.org/abs/2503.09701

#NLP #ActiveLearning #LLMs

11.01.2026 13:53 👍 1 🔁 0 💬 0 📌 0

Active learning is practical, so much of it isn't documented. We surveyed the NLP community to collect insights on active learning.

Some key takeaways:
- Data annotation is still a bottleneck.
- LLMs complement rather than replace careful annotation.
- Longstanding challenges limit wider adoption.

11.01.2026 13:53 👍 1 🔁 0 💬 1 📌 0

🎉 Somewhat late, but excited to share: our paper “Reassessing Active Learning Adoption in Contemporary NLP: A Community Survey” has been accepted to #EACL2026!

We asked: What are data annotation needs in the era of LLMs? How is active learning actually used? How does it compare to other methods?

11.01.2026 13:53 👍 6 🔁 0 💬 1 📌 0

Wow.

Twitch says they won't allow "hatred, prejudice, or intolerance" on their platform.

But their automated moderation tool (AutoMod) really sucks

It misses ≈94% of hateful messages (unless there are slurs)

+ it blocks ≈90% of benign messages that happen to use sensitive words in positive ways.

28.12.2025 17:11 👍 15 🔁 3 💬 2 📌 1

⚖️ Measuring Scalar Constructs in Social Science with LLMs

with rising (and established) stars in Computational Social Science

@haukelicht.bsky.social
@rupak-s.bsky.social
@patrickwu.bsky.social
@pranavgoel.bsky.social
@elliottash.bsky.social
@alexanderhoyle.bsky.social

arxiv.org/abs/2509.03116

17.11.2025 09:29 👍 15 🔁 4 💬 0 📌 0

coral-nlp/german-commons · Datasets at Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science.

We just released "German Commons", the largest openly-licensed German text dataset for LLM training: 154B tokens with clear usage rights for research and commercial use.

huggingface.co/datasets/coral-nlp/german-commons

27.10.2025 12:45 👍 20 🔁 9 💬 1 📌 0

Thank you for the clarification, however, KL3M was released in February 2024. That being said, I appreciate your contribution, this is solid work.

There is really no need to exaggerate your claim, this undermines the good work you have done.

Still, congrats on the publication!

28.09.2025 11:00 👍 1 🔁 0 💬 0 📌 0

Are you sure that you are the first? What about the KL3M models?
arxiv.org/pdf/2504.07854

28.09.2025 10:32 👍 0 🔁 0 💬 1 📌 0

🏆 Thrilled to share that our HateDay paper has received an Outstanding Paper Award at #ACL2025

Big thanks to my wonderful co-authors: @deeliu97.bsky.social, Niyati, @computermacgyver.bsky.social, Sam, Victor, and @paul-rottger.bsky.social!

Thread 👇and data avail at huggingface.co/datasets/man...

31.07.2025 08:05 👍 32 🔁 7 💬 2 📌 1

Congratulations, Manuel!

04.08.2025 12:13 👍 1 🔁 0 💬 0 📌 0

Honored to win the ICTIR Best Paper Honorable Mention Award for "Axioms for Retrieval-Augmented Generation"!
Our new axioms are integrated with ir_axioms: github.com/webis-de/ir_...
Nice to see axiomatic IR gaining momentum.

18.07.2025 14:18 👍 16 🔁 6 💬 1 📌 0

Happy to share that our paper "The Viability of Crowdsourcing for RAG Evaluation" received the Best Paper Honourable Mention at #SIGIR2025! Very grateful to the community for recognizing our work on improving RAG evaluation.

📄 webis.de/publications...

16.07.2025 21:04 👍 27 🔁 10 💬 2 📌 1

Dory from finding nemo with the quote: "I remember it like it was yesterday. Of course, I dont remember yesterday."

Do not forget to participate in the #TREC2025 Tip-of-the-Tongue (ToT) Track :)

The corpus and baselines (with run files) are now available and easily accessible via the ir_datasets API and the HuggingFace Datasets API.

More details are available at: trec-tot.github.io/guidelines

27.06.2025 14:46 👍 11 🔁 7 💬 0 📌 0

Oh no, what happened to Argilla? @hf.co Could you explain what's going on? It has barely been a year since you bought it.

#nlproc #nlp #ml

01.06.2025 18:10 👍 4 🔁 0 💬 0 📌 0

@ai2.bsky.social Any plans for plagiarism detection in semantic scholar? This would be incredibly useful, especially with the growing influx of (semi-)automatically generated papers.

26.05.2025 18:00 👍 0 🔁 0 💬 0 📌 0

Big fan of @ai2.bsky.social's semantic scholar feeds. Usually great for paper recommendations. Yesterday it recommended... a paper that blatantly plagiarized from a former student's thesis that I co-supervised. So, I guess the algorithm really knows my interests 😅.

26.05.2025 18:00 👍 0 🔁 0 💬 1 📌 0

Our recent paper on the impact of register (genre) on LLM performance. Key points: news do poor in evaluation, while opinionated texts are among the best. We hope this work can be used to understand the impact of register on LLMs and improve training data mixes! arxiv.org/abs/2504.01542

15.04.2025 12:57 👍 5 🔁 1 💬 0 📌 1

Plot shows the relationship between compute used to predict a ranking of datasets and how accurately that ranking reflects performance at the target (1B) scale of models pretrained from scratch on those datasets.

Ever wonder how LLM developers choose their pretraining data? It’s not guesswork— all AI labs create small-scale models as experiments, but the models and their data are rarely shared.
DataDecide opens up the process: 1,050 models, 30k checkpoints, 25 datasets & 10 benchmarks 🧵

15.04.2025 13:01 👍 52 🔁 11 💬 1 📌 3

A bit of a mess around the conflict of COLM with the ARR (and to lesser degree ICML) reviews release. We feel this is creating a lot of pressure and uncertainty. So, we are pushing our deadlines:

Abstracts due March 22 AoE (+48hr)
Full papers due March 28 AoE (+24hr)

Plz RT 🙏

20.03.2025 18:20 👍 37 🔁 31 💬 3 📌 2

Can a Large Language Model (LLM) with zero Pokémon-specific training achieve expert-level performance in competitive Pokémon battles?
Introducing PokéChamp, our minimax LLM agent that reaches top 30%-10% human-level Elo on Pokémon Showdown!
New paper on arXiv and code on github!

07.03.2025 15:46 👍 33 🔁 5 💬 1 📌 3

(1/8) Excited to share some new work: TESS 2!
TESS 2 is an instruction-tuned diffusion LM that can perform close to AR counterparts for general QA tasks, trained by adapting from an existing pretrained AR model.
📜 Paper: arxiv.org/abs/2502.13917
🤖 Demo: huggingface.co/spaces/hamis...

More below ⬇️

20.02.2025 18:08 👍 4 🔁 1 💬 1 📌 1

The Ultra-Scale Playbook - a Hugging Face Space by nanotron The ultimate guide to training LLM on large GPU Clusters

After 6+ months in the making and over a year of GPU compute, we're excited to release the "Ultra-Scale Playbook": hf.co/spaces/nanot...

A book to learn all about 5D parallelism, ZeRO, CUDA kernels, how/why overlap compute & coms with theory, motivation, interactive plots and 4000+ experiments!

19.02.2025 18:10 👍 179 🔁 52 💬 2 📌 5

More than 8500 submissions to ACL 2025 (ARR February 2025 cycle)! That is an increase of 3000 submissions compared to ACL 2024. It will be a fun reviewing period. 😅💯
@aclmeeting.bsky.social #ACL2025 #ACL2025nlp #NLP

16.02.2025 13:19 👍 20 🔁 5 💬 1 📌 4

Fixed: We need your support *for a* web survey.

Sorry, it seems bluesky has no edit feature yet.

12.01.2025 17:34 👍 6 🔁 0 💬 0 📌 0

I have the feeling, I did not reach the NLP crowd on bluesky yet. Where are the large groups here? Who do I have to ping❓

12.01.2025 17:26 👍 6 🔁 0 💬 0 📌 0

Please consider participating or sharing our survey! (If you have any experience with supervised learning in natural language processing, you are eligible to participate in our survey.)

12.01.2025 17:26 👍 4 🔁 0 💬 1 📌 0

The survey has a partial focus on, but not is limited to, active learning. See the original post for details.
➡️ Extended Deadline: January 26th, 2025.

12.01.2025 17:26 👍 0 🔁 0 💬 1 📌 0

🔥 𝐅𝐢𝐧𝐚𝐥 𝐂𝐚𝐥𝐥 𝐚𝐧𝐝 𝐃𝐞𝐚𝐝𝐥𝐢𝐧𝐞 𝐄𝐱𝐭𝐞𝐧𝐬𝐢𝐨𝐧: Survey on Data Annotation and Active Learning

We need your support in web survey in which we investigate how recent advancements in NLP, particularly LLMs, have influenced the need for labeled data in supervised machine learning.

#NLP #NLProc #ML #AI

12.01.2025 17:26 👍 4 🔁 0 💬 2 📌 1

Hallo and happy New Year #NLProc :) Julia Romberg, a postdoc in my group in Cologne, together with other collaborators, is conducting a survey on the use of Active Learning in NLP. Find the link in the thread below!

06.01.2025 11:57 👍 5 🔁 2 💬 0 📌 0

❤️ We’re seeking responses from across the globe! If you know 1–3 people who might qualify for this survey—particularly those in different regions—please share it with them. We’d really appreciate it!

#NLP #NLProc #Annotation

27.12.2024 14:25 👍 2 🔁 0 💬 0 📌 0

Christopher Schröder

Latest posts by Christopher Schröder @cschroeder