Skip to content

longbench_en

iMountTai edited this page Dec 25, 2023 · 7 revisions

Inference Script for LongBench

LongBench is a benchmark for bilingual, multitask, and comprehensive assessment of long context understanding capabilities of large language models. This project tested the performance of the relevant models on the LongBench dataset.

In the following, we will introduce the prediction method for the LongBench dataset. Users can also refer to our Colab notebook: Open In Colab

Preparation

Environment setup

Set up the environment according to requirements.txt, which has been copied to scripts/longbench:

pip install -r scripts/longbench/requirements.txt

Dataset Preparation

The inference script will automatically download the dataset from 🤗 Datasets.

Runing the Evaluation Script

Run the following script:

model_path=path/to/chinese_llama2_or_alpaca2
output_path=path/to/output_dir
data_class=zh
with_inst="true" # or "false" or "auto"
max_length=3584

cd scripts/longbench
python pred_llama2.py \
    --model_path ${model_path} \
    --predict_on ${data_class} \
    --output_dir ${output_dir} \
    --with_inst ${with_inst} \
    --max_length ${max_length}

Arguments

  • --model_path ${model_path}: Path to the model to be evaluated (the full Chinese-LLaMA-2 model or Chinese-Alpaca-2 model, not LoRA).
  • --predict_on {data_class}: The tasks to predict. Possible values are enzhcode or their combination such as en,zh,code.
  • --output_dir ${output_dir}:Output directory of the predictions and logs.
  • --max_length ${max_length}:Max length of the instructions. Notice that the lengths of system prompt and task-related prompt is not included.
  • --with_inst ${with_inst}:Whether use the system prompt and template of Chinese-Alpaca-2 when constructing the instructions:
    • true:Use the system prompt and template on all tasks
    • false:Use the system prompt and template on none of tasks
    • auto:Use the system prompt and template on some tasks (default strategy of LongBench official code) We suggest setting --with_inst to auto when testing Alpaca; setting --with_inst to false when testing LLaMA.
  • --gpus ${gpus}:Specify GPUs with this argument, such as 0,1.
  • --alpha ${alpha}: The scaling factor of NTK method, usually set to sequence_length / mdoel_context_length * 2 - 1, or simply set to auto.
  • --e:Predict on the LongBench-E dataset. See the official documentation for details of LongBench-E.
  • --use_flash_attention_2: use Flash-Attention to speed up inference.
  • --use_ntk: Using dynamic-ntk to extend the context window. Does not work with the 64K version of the long context model.

When the script has finished running, the prediction files are stored under ${output_dir}/pred/ or ${output_dir}/pred_e/ (depends on if testing on LongBench-E). Run the following command to compute metrics:

python eval.py --output_dir ${output_dir}

If testing on LongBench-E, provide -e when computing metrics:

python eval.py --output_dir ${output_dir} -e

The results are stored in ${output_dir}/result.json or ${output_dir}/pred_e/result.json. For example, the results of Chinese-Alpaca-2-7B on LongBench Chinese tasks (--predict_on zh) are:

{
    "lsht": 20.5,
    "multifieldqa_zh": 32.74,
    "passage_retrieval_zh": 4.5,
    "vcsum": 11.52,
    "dureader": 16.59
}
Clone this wiki locally