Kush Jain's Avatar

Kush Jain

@kjain14

SE PhD Student at Carnegie Mellon University interested in NLP for software engineering, program analysis and software testing. Former intern at Facebook AI Research.

44
Followers
49
Following
7
Posts
22.11.2024
Joined
Posts Following

Latest posts by Kush Jain @kjain14

Preview
TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark Code generation models can help improve many common software tasks ranging from code completion to defect prediction. Most of the existing benchmarks for code generation LLMs focus on code authoring o...

(6/6) Check out our preprint for more details: arxiv.org/abs/2410.00752 (w/Gabriel Synnaeve and Baptiste Rozière)

Homepage: testgeneval.github.io
Sample Explorer: testgeneval.github.io/demo.html
Dataset: huggingface.co/datasets/kja...
Code: github.com/facebookrese...

19.12.2024 20:59 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 1
Post image

(5/6) Sampling does not solve this problem either. For test completion pass@k tends to plateau at 90%, and for test suite generation even with extensive sampling, coverage values remain low!

19.12.2024 20:59 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

(4/6) We analyze errors from top models, finding that even current state-of-the-art models struggle with hallucination and reasoning about execution.

19.12.2024 20:59 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

(3/6) Models also struggle with test completion, with top models only achieving 63.5% pass@5 for our first test completion setting (coverage improvement is also low at 26.9%).

19.12.2024 20:59 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

(2/6) Current state-of-the-art models struggle with test suite generation. Even the best model, GPT-4o, only gets 35.2% coverage on TestGenEval.

19.12.2024 20:59 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

(1/6) TestGenEval is sourced from large scale Python repositories and targets real-world usecases: test authoring simulates a developer writing a test suite from scratch, while test completion mimics a developer aiming to improve the coverage of an existing test suite.

19.12.2024 20:59 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Thrilled to announce our new work TestGenEval, a benchmark that measures unit test generation and test completion capabilities. This work was done in collaboration with the FAIR CodeGen team.

Preprint: arxiv.org/abs/2410.00752
Leaderboard: testgeneval.github.io/leaderboard....

19.12.2024 20:59 πŸ‘ 17 πŸ” 7 πŸ’¬ 1 πŸ“Œ 1

Hi, Bluesky! πŸ‘‹
I’m Catarina, a dual PhD student in πŸ–₯️ Software Engineering with the CMU Portugal program ( @carnegiemellon.bsky.social and U. Lisbon).

Imagine a world with reliable software and user-friendly verification tools. Let’s build it together! πŸš€

#PhDlife #SE #PL #HCI #CMU-Portugal

26.11.2024 17:07 πŸ‘ 16 πŸ” 4 πŸ’¬ 0 πŸ“Œ 0

And now that we’re all here, some work!🚨 Are Large Language Models Memorizing Bug Benchmarks? 🚨
There’s growing concern that LLMs for SE are prone to data leakage, but no one has quantified it... until now. πŸ•΅οΈβ€β™‚οΈ 1/

26.11.2024 16:06 πŸ‘ 65 πŸ” 11 πŸ’¬ 2 πŸ“Œ 1