(5/6) Sampling does not solve this problem either. For test completion pass@k tends to plateau at 90%, and for test suite generation even with extensive sampling, coverage values remain low!
19.12.2024 20:59
π 1
π 0
π¬ 1
π 0
(4/6) We analyze errors from top models, finding that even current state-of-the-art models struggle with hallucination and reasoning about execution.
19.12.2024 20:59
π 1
π 0
π¬ 1
π 0
(3/6) Models also struggle with test completion, with top models only achieving 63.5% pass@5 for our first test completion setting (coverage improvement is also low at 26.9%).
19.12.2024 20:59
π 1
π 0
π¬ 1
π 0
(2/6) Current state-of-the-art models struggle with test suite generation. Even the best model, GPT-4o, only gets 35.2% coverage on TestGenEval.
19.12.2024 20:59
π 1
π 0
π¬ 1
π 0
(1/6) TestGenEval is sourced from large scale Python repositories and targets real-world usecases: test authoring simulates a developer writing a test suite from scratch, while test completion mimics a developer aiming to improve the coverage of an existing test suite.
19.12.2024 20:59
π 1
π 0
π¬ 1
π 0
Thrilled to announce our new work TestGenEval, a benchmark that measures unit test generation and test completion capabilities. This work was done in collaboration with the FAIR CodeGen team.
Preprint: arxiv.org/abs/2410.00752
Leaderboard: testgeneval.github.io/leaderboard....
19.12.2024 20:59
π 17
π 7
π¬ 1
π 1
Hi, Bluesky! π
Iβm Catarina, a dual PhD student in π₯οΈ Software Engineering with the CMU Portugal program ( @carnegiemellon.bsky.social and U. Lisbon).
Imagine a world with reliable software and user-friendly verification tools. Letβs build it together! π
#PhDlife #SE #PL #HCI #CMU-Portugal
26.11.2024 17:07
π 16
π 4
π¬ 0
π 0
And now that weβre all here, some work!π¨ Are Large Language Models Memorizing Bug Benchmarks? π¨
Thereβs growing concern that LLMs for SE are prone to data leakage, but no one has quantified it... until now. π΅οΈββοΈ 1/
26.11.2024 16:06
π 65
π 11
π¬ 2
π 1