longbench_en

Inference Script for LongBench

LongBench is a benchmark for bilingual, multitask, and comprehensive assessment of long context understanding capabilities of large language models. This project tested the performance of the relevant models on the LongBench dataset.

In the following, we will introduce the prediction method for the LongBench dataset. Users can also refer to our Colab notebook:

Preparation

Environment setup

Set up the environment according to requirements.txt, which has been copied to scripts/longbench:

pip install -r scripts/longbench/requirements.txt

Dataset Preparation

The inference script will automatically download the dataset from 🤗 Datasets.

Runing the Evaluation Script

Run the following script:

model_path=path/to/chinese_llama2_or_alpaca2
output_path=path/to/output_dir
data_class=zh
with_inst="true" # or "false" or "auto"
max_length=3584

cd scripts/longbench
python pred_llama2.py \
    --model_path ${model_path} \
    --predict_on ${data_class} \
    --output_dir ${output_dir} \
    --with_inst ${with_inst} \
    --max_length ${max_length}

Arguments

--model_path ${model_path}: Path to the model to be evaluated (the full Chinese-LLaMA-2 model or Chinese-Alpaca-2 model, not LoRA).
--predict_on {data_class}: The tasks to predict. Possible values are en，zh，code or their combination such as en,zh,code.
--output_dir ${output_dir}：Output directory of the predictions and logs.
--max_length ${max_length}：Max length of the instructions. Notice that the lengths of system prompt and task-related prompt is not included.
--with_inst ${with_inst}：Whether use the system prompt and template of Chinese-Alpaca-2 when constructing the instructions:
- true：Use the system prompt and template on all tasks
- false：Use the system prompt and template on none of tasks
- auto：Use the system prompt and template on some tasks (default strategy of LongBench official code) We suggest setting --with_inst to auto when testing Alpaca; setting --with_inst to false when testing LLaMA.
--gpus ${gpus}：Specify GPUs with this argument, such as 0,1.
--alpha ${alpha}： The scaling factor of NTK method, usually set to sequence_length / mdoel_context_length * 2 - 1, or simply set to auto.
--e：Predict on the LongBench-E dataset. See the official documentation for details of LongBench-E.
--use_flash_attention_2: use Flash-Attention to speed up inference.
--use_ntk: Using dynamic-ntk to extend the context window. Does not work with the 64K version of the long context model.

When the script has finished running, the prediction files are stored under ${output_dir}/pred/ or ${output_dir}/pred_e/ (depends on if testing on LongBench-E). Run the following command to compute metrics:

python eval.py --output_dir ${output_dir}

If testing on LongBench-E, provide -e when computing metrics:

python eval.py --output_dir ${output_dir} -e

The results are stored in ${output_dir}/result.json or ${output_dir}/pred_e/result.json. For example, the results of Chinese-Alpaca-2-7B on LongBench Chinese tasks (--predict_on zh) are:

{
    "lsht": 20.5,
    "multifieldqa_zh": 32.74,
    "passage_retrieval_zh": 4.5,
    "vcsum": 11.52,
    "dureader": 16.59
}

中文文档

模型合并与转换
- 在线模型合并与转换（Colab）
- 手动模型合并与转换
模型量化、推理、部署
效果与评测
训练脚本
- 预训练脚本
- 指令精调脚本
基于人类反馈的强化学习
- 奖励模型
- 强化学习
常见问题

English Docs

Model Reconstruction
- Online Conversion (Colab)
- Manual Conversion
Model Quantization, Inference and Deployment
System Performance
Training Scripts
- Pre-training Scripts
- Instruction Fine-tuning Scripts
Reinforcement Learning from Human Feedback
- Reward Modeling
- Reinforcement Learning
FAQ

Provide feedback

Saved searches

Use saved searches to filter your results more quickly