This folder includes the code for the first two parts of the benchmark construction procedure as described in the paper, specifically 1. Repo selection and data scraping, and 2. Attribute-based filtering.
We include a comprehensive tutorial that describes the end-to-end procedure for collecting evaluation task instances from PyPI repositories.
SWE-bench's collection pipeline is currently designed to target PyPI packages. We hope to expand SWE-bench to more repositories and languages in the future.
To run collection on your own repositories, run the run_get_tasks_pipeline.sh
script. Given a repository or list of repositories (formatted as owner/name
), for each repository this command will generate...
<repo>-prs.jsonl
file containing the metadata for every pull request from the repository.<repo>-task-instances.jsonl.all
file containing all valid task instances (has associated issues + gold patch).- This file's values can be used for fine tuning purposes.
<repo>-task-instances.jsonl
file containing valid task instances that also has associated tests.- This file's values are candidate task instances. Once validated, they can be used for evaluation purposes.
- The
.json.all
includes these task instances as well.
In this section, we briefly describe each of the files in this directory and its usage details.
🧐 GitHub Repository Selection
get_top_pypi.py
- Purpose: Retrieves the PyPI URL, GitHub URL, # of ⭐, and # of Issues + PRs for the top 5000 most downloaded PyPI packages.
- Usage:
python get_top_pypi.py
⛏️ GitHub Data Collection
print_pulls.py
- Purpose: Given the
<owner/name>
of a GitHub repo, this script writes the raw information for all the repo's PRs to a single.jsonl
file - Usage:
python print_pulls.py <repo name> <path to PRs .jsonl file> --token <GitHub Token>
- Purpose: Given the
build_dataset.py
- Purpose: Given the path to a PRs
.jsonl
file generated byprint_pulls.py
, this script attempts to convert each PR to a task instance. It creates ajsonl.all
file for any PRs with an issue and a.jsonl
file for any PRs with both an issue and modifications to that repository's tests. - Usage:
python build_dataset.py <path to PRs .jsonl file> <path to output .jsonl file> --token <Github Token>
- Purpose: Given the path to a PRs
get_tasks_pipeline.py
- Purpose: Automates invocation of the repo → task instance construction pipeline (
print_pulls.py
+build_dataset.py
) for multiple repositories - Usage:
./run_get_tasks_pipeline
(Check file for arguments)
- Purpose: Automates invocation of the repo → task instance construction pipeline (
🎵 Fine Tuning Dataset Construction
build_dataset_ft.py
- Purpose: Given the path to a collection of
.jsonl.all
files generated bybuild_dataset.py
, this is a simple script to combine all such files into a single.jsonl
that can be used to construct a instruction tuning dataset based on [problem statement + original code, code Δ] pairs. - Usage:
./run_build_dataset_ft
(Check file for arguments)
- Purpose: Given the path to a collection of
🪞 Mirroring Repositories
make_repo.sh
- Purpose: A script for creating a mirror repository of an existing repository on GitHub. Examples available under the swe-bench organization.
- Usage:
python call_make_repo.py
(Check file for arguments)
🧹 Clean Up
delete_gh_workflows.py
- Purpose: Recurring workflows from mirror repositories can clog up your inbox for the email account associated with your GitHub token. Given a repo URL, this will automate removing the
.github/workflows
folder from all branches of a repository. - Usage:
python delete_gh_workflows.py <repo URL>
- Purpose: Recurring workflows from mirror repositories can clog up your inbox for the email account associated with your GitHub token. Given a repo URL, this will automate removing the
remove_envs.py
- Purpose: SWE Bench's evaluation + validation harnesses rely on the creation of multiple virtual environments with conda to speed up benchmark evaluation. Use these script to parallelize conda environment removal for environments named with the same prefix.
- Usage:
python remove_envs.py <prefix> --conda_path <path to conda installation>