CRUXEVAL-X: A Benchmark for Multilingual Code
Reasoning, Understanding and Execution

Dataset Description

CRUXEVAL-X stands as a multi-lingual code reasoning benchmark, encompassing 19 programming languages and built upon the foundation of CRUXEVAL. This comprehensive resource features a minimum of 600 subjects per language, collectively contributing to a robust total of 19,000 content-consistent tests.

In this repository, we provide this dataset and the method to construct the benchmark.

Usage

Environment

build image

cd ./docker
docker build -t cruxeval_x .

run container

cd ./docker
bash run_docker.bash
docker exec -it cruxeval_x_env /bin/bash

Benchmark Construction

before run the benchmark construction, you need to download the deepseekcoder-33b-instruct model to ./model, and replace "your api key", "your base url" and "your model name" with your own.

if you want to run the full pipeline

cd ./cruxeval-x
bash ./script/benchmark_construction.sh

if you want to run only one step, find the script for the specific step in ./script and run it.

Dataset

all the dataset is in ./data, data dir start with "example" is the examples used for few-shot inferences. The final data is in ./data/cruxeval_preprocessed, which you can also download in 🤗 hugging face.

the data is in the format of json, each line is a json object, the format is:

{
    "id": "The id of each problem, which is in consistent with the cruxeval benchmark. Different languanges with the same id means the same problem.",
    "code": "The code which model need to understand the execution process",
    "input_reasoning": "the check function which replace the input to '????'",
    "output_reasoning": "the check function which replace the output to '????'",
}

Inference

The script for inference is in ./script

for open-source models, you can first download the model to ./model, and then run the script.

cd ./cruxeval-x
bash ./script/inference_vllm.bash

for close-source models, you need to provide the model name, api key and base url, and then run the script.

cd ./cruxeval-x
bash ./script/inference_openai.bash

Results

We have evaluated some models on our benchmark. The results can be found in our leaderboard.

We also provide the detail inference results for each model. You can find the example phi-1 result in ./cruxeval-x/infer_results/phi-1. For all the LLMs results, you can download them from here.

Submission

If you want to submit your model results to our benchmark, you can submit it to our Codabench platform

We provide the inference script in ./cruxeval-x/script/inference_codabench.bash, you can run it to get the inference results.

cd ./cruxeval-x
bash ./script/inference_codabench.bash

We also provide the example for submission to our codabench in ./cruxeval-x/infer_results/phi-1_codabench

Following is the detail description of the submission format:

├── phi-1_codabench.zip # you should zip all the files in this folder
│   ├── cpp_input.json # the file name should in this format: {lang}_{task}.json
        {
            "id": "index of the code in cruxeval",
            "code": "input code with '????'",
            "answer": "the answer of your model"
        }
        # if you want to ignore some problems, you can only give the "id" without "code" and "answer"
...
│   ├── cpp_output.json
│   ├── cs_input.json
│   ├── cs_output.json

The final file you submit to codabench is same as the example in ./cruxeval-x/infer_results/phi-1_codabench/phi-1_codabench.zip,

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
MutiPL-E		MutiPL-E
dataset_build		dataset_build
datasets		datasets
docker		docker
img		img
infer_results		infer_results
inference		inference
script		script
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CRUXEVAL-X: A Benchmark for Multilingual Code
Reasoning, Understanding and Execution

Dataset Description

Usage

Environment

build image

run container

Benchmark Construction

Dataset

Inference

Results

Submission

About

Releases

Packages

Contributors 2

Languages

CRUXEVAL-X/cruxeval-x

Folders and files

Latest commit

History

Repository files navigation

CRUXEVAL-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution

Dataset Description

Usage

Environment

build image

run container

Benchmark Construction

Dataset

Inference

Results

Submission

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

CRUXEVAL-X: A Benchmark for Multilingual Code
Reasoning, Understanding and Execution

Packages