A repository of property-based tests for thorough benchmarking of LLM code generation.
The /tests
directory contains directories labeled from 0
to 163
, each of which contains a strategy.py
file. This file contains the hypothesis strategy for the corresponding problem from the HumanEval dataset. __init__.py
files have been placed in each directory to allow for importing of the tests as modules. The strategies are available as the strategy
attribute of these strategy
modules. Usage of the strategies is as follows.
from hypothesis import given, strategies
@given(strategies.tuples(*st))
def test_property(args):
# call functions as f(*args)
# for example, assert f(*args) == ground_truth(*args)
# ...
Here, st
is the imported strategy. One way to do this is using the importlib
module.
import importlib
st_module = importlib.import_module(f"test.{humaneval_id}.strategy")
st = st_module.strategy
We show that it is possible to improve the thoroughness of programming benchmarks using Property-Based Testing (PBT), leveraging the canonical solutions within these benchmarks. For the HumanEval dataset, since adequate property-based tests cannot be automatically generated using rule-based tools, we carefully construct these tests manually. We show that our approach using PBT allows us to synthesize as thorough test cases as those generated using type-aware mutations in Liu et al.'s EvalPlus1. However, our approach can be easily adapted to other contexts.
We share our full set of property-based tests as a complementary resource to existing manual and synthesized test suites.
A non-trivial strategy.
# HumanEval 129: minPath
@composite
def create_grid(draw, n_st=integers(min_value=2, max_value=MAX_SEQUENCE_LEN)):
n = draw(n_st)
grid = draw(lists(lists(integers(), min_size=n, max_size=n), min_size=n, max_size=n))
perm = draw(permutations(range(1, n**2 + 1)))
# fill grid with perm
for i in range(n):
for j in range(n):
grid[i][j] = perm[i*n + j]
return grid
grid = create_grid()
k = integers(min_value=1, max_value=MAX_INT)
strategy = grid, k
Examples of additional constraints on the input space. Here, we have restricted the alphabet and introduced bounds on the lengths of strings and lists.
# HumanEval 134: check_if_last_char_is_a_letter
txt = text(alphabet='abcde0123 ')
strategy = txt
# HumanEval 143: words_in_sentence
sentence = text(alphabet="a ", min_size=1, max_size=100)
.map(lambda s: re.sub(r"\s+", " ", s))
.filter(lambda s: not (s.startswith(" ") or s.endswith(" ")))
strategy = sentence
# HumanEval 158: find_max
words = lists(text(alphabet='abc', max_size=MAX_SEQUENCE_LEN), min_size=1, max_size=MAX_SEQUENCE_LEN)
strategy = words
Tip
The ghAIstwriter project uses these strategies to create a dataset for finetuning LLMs to generate such strategies automatically.
For the MBPP dataset, we demonstrate that these tests can be generated largely automatically using GPT-3.5 by providing few-shot prompts based on some of our manually constructed tests. This demonstrates that our approach can be easily scaled to other datasets.
Warning
This is a work in progress, but some preliminary results are available here.
Work on PropertyEval has led to several contributions to EvalPlus as suggestions for improving contracts in many problems. These have been acknowledged by the authors of EvalPlus and incorporated in their subsequent releases.
- Incomplete contracts in problems involving balanced parantheses (1, 6)
- Contract misses infinities and nan (2, 99)
- Contract misses empty list in problem 35
- Incomplete contract in HumanEval #32 (
find_zero
) - Incomplete contract in HumanEval #160 (
do_algebra
)
The /humaneval_groundtruth
directory contains canonical solutions to HumanEval problems, adapted from the ground truth solutions provided with EvalPlus v0.1.0. The results from the equivalence tests on code samples for 84 (model, size, temperature) combinations provided with EvalPlus v0.1.0 are available in evaldata.csv
. The script for executing this benchmark is a modified fork2 of the EvalPlus script.
The limits/limits.py
file contains several standardized limits for the strategies. The limits/fuzzer.py
script is for running fuzz-tests on all HumanEval ground truth with the strategies in order to validate these limits.
Footnotes
-
Jiawei Liu et al. “Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation”. In: arXiv preprint arXiv:2305.01210 (2023) ↩
-
We forked EvalPlus and modified the evaluation script to evaluate code samples with PropertyEval's property-based tests as well, in addition to the
Base
andBase + Extra
test cases. We further modified the existing pipeline for estimating pass@k for PropertyEval's property-based tests also. The fork is available as EvalPlusPro. Some points to note are as follows.- The property-based tests are executed with 1000 examples, with
@settings(max_examples=1000)
. - Instead of the time limits enforced by EvalPlus, we use the default deadline of 200ms that comes with Hypothesis.
- The property-based tests are executed with 1000 examples, with