Kush Jain (@kjain14)

TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark Code generation models can help improve many common software tasks ranging from code completion to defect prediction. Most of the existing benchmarks for code generation LLMs focus on code authoring o...

(6/6) Check out our preprint for more details: arxiv.org/abs/2410.00752 (w/Gabriel Synnaeve and Baptiste Rozière)

Homepage: testgeneval.github.io
Sample Explorer: testgeneval.github.io/demo.html
Dataset: huggingface.co/datasets/kja...
Code: github.com/facebookrese...

19.12.2024 20:59 👍 0 🔁 0 💬 0 📌 1

(5/6) Sampling does not solve this problem either. For test completion pass@k tends to plateau at 90%, and for test suite generation even with extensive sampling, coverage values remain low!

19.12.2024 20:59 👍 1 🔁 0 💬 1 📌 0

(4/6) We analyze errors from top models, finding that even current state-of-the-art models struggle with hallucination and reasoning about execution.

19.12.2024 20:59 👍 1 🔁 0 💬 1 📌 0

(3/6) Models also struggle with test completion, with top models only achieving 63.5% pass@5 for our first test completion setting (coverage improvement is also low at 26.9%).

19.12.2024 20:59 👍 1 🔁 0 💬 1 📌 0

(2/6) Current state-of-the-art models struggle with test suite generation. Even the best model, GPT-4o, only gets 35.2% coverage on TestGenEval.

19.12.2024 20:59 👍 1 🔁 0 💬 1 📌 0

(1/6) TestGenEval is sourced from large scale Python repositories and targets real-world usecases: test authoring simulates a developer writing a test suite from scratch, while test completion mimics a developer aiming to improve the coverage of an existing test suite.

19.12.2024 20:59 👍 1 🔁 0 💬 1 📌 0

Thrilled to announce our new work TestGenEval, a benchmark that measures unit test generation and test completion capabilities. This work was done in collaboration with the FAIR CodeGen team.

Preprint: arxiv.org/abs/2410.00752
Leaderboard: testgeneval.github.io/leaderboard....

19.12.2024 20:59 👍 17 🔁 7 💬 1 📌 1

Hi, Bluesky! 👋
I’m Catarina, a dual PhD student in 🖥️ Software Engineering with the CMU Portugal program ( @carnegiemellon.bsky.social and U. Lisbon).

Imagine a world with reliable software and user-friendly verification tools. Let’s build it together! 🚀

#PhDlife #SE #PL #HCI #CMU-Portugal

26.11.2024 17:07 👍 16 🔁 4 💬 0 📌 0

And now that we’re all here, some work!🚨 Are Large Language Models Memorizing Bug Benchmarks? 🚨
There’s growing concern that LLMs for SE are prone to data leakage, but no one has quantified it... until now. 🕵️‍♂️ 1/

26.11.2024 16:06 👍 65 🔁 11 💬 2 📌 1

Kush Jain

Latest posts by Kush Jain @kjain14