CoT-eval

A framework for evaluating the effectiveness of chain-of-thought reasoning in language models.

🔥 Open CoT Leaderboard | 🔥 Results Exploration (Notebook)

Table of Contents

Goal
Pipeline
Installation
Usage
Misc
Built with
License

Goal

Set up a pipeline (and provide missing parts) to evaluate the effectiveness of chain-of-thought reasoning (COT) in language models.

Pipeline

COT-eval is intended to be used in conjunction with Eleuther's lm-evaluation-harness (or similiar packages, such as catwalk) to assess a model's ability to generate high quality (i.e., effective) chain-of-thought reasoning traces.

The pipeline is as follows:

Specify an eval configuration, including
- model: the model to evaluate (e.g. mistralai/Mistral-7B-Instruct-v0.2)
- task: the task to evaluate on (logiqa, lsat)
- chain: the prompt chain used to generate the reasoning traces
- decoding: the decoding strategy and parameters to use for reasoning (beam search, temperature, etc.)
Pertubate the task. (Because of potential training data contamination.)
Run cot-eval to generate the reasoning traces with the model (and according to the configuration) for the perturbated task. (Push reasoning traces to HF hub.)
Run lm-evaluation-harness to evaluate the model on the original task. This gives us scores-1.
Run lm-evaluation-harness to evaluate the model on the perturbated task. This gives us scores-2.
Run lm-evaluation-harness to evaluate the model on the perturbated task with added reasoning traces. This gives us scores-3.
Conclude:
- The difference between scores-1 and scores-2 is an indicator of training data contamination.
- The difference between scores-2 and scores-3 is an indicator of COT effectiveness, i.e. the model's reasoning skill.

Installation

git clone https://github.com/logikon-ai/cot-eval.git
cd cot-eval
pip install -e ".[cuda]"

Usage

Note

Use a personal HUGGINGFACEHUB_API_TOKEN. Note that you have to be a member of the Open CoT Leaderboard for this to work.

See run.sh for an implementation of the pipeline.

cot-eval --help

With Docker 🐳

Step 1. Clone cot-eval repo.

git clone https://github.com/logikon-ai/cot-eval.git
cd cot-eval

Step 2. Pull docker image

docker pull logikon/cot-eval:latest

Step 2a. (Alternatively:) Build docker image locally (allows you to adapt build args, e.g. VLLM_VERSION)

docker build --no-cache -t cot-eval --build-arg="VLLM_VERSION=0.3.0" . # change vllm version if necessary

Step 3. Set parameters and arguments

vim config.env  # adapt config.env, set especially NEXT_MODEL_PATH="..." and HUGGINGFACEHUB_API_TOKEN="..."

Step 4. Run docker container

cat config.env  # check
docker run -it --rm --gpus all --ipc=host --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --env-file config.env logikon/cot-eval:latest

With Enroot

# export TMPDIR=...
cd $TMPDIR

git clone https://github.com/logikon-ai/cot-eval.git
# edit config
vim cot-eval/config.env

export ENROOT_DATA_PATH=$TMPDIR/enroot-data
mkdir $ENROOT_DATA_PATH
export ENROOT_CONFIG_PATH=$TMPDIR/enroot-config
mkdir $ENROOT_CONFIG_PATH
touch $ENROOT_CONFIG_PATH/enroot.config
mkdir $ENROOT_CONFIG_PATH/environ.d
cp cot-eval/config.env $ENROOT_CONFIG_PATH/environ.d

enroot import docker://logikon/cot-eval
enroot create --name cot-eval logikon+cot-eval.sqsh
rm logikon+cot-eval.sqsh

enroot start --rw cot-eval

Alternatively:

ENROOT_SQUASH_OPTIONS='-comp lz4 -noD' enroot import docker://logikon/cot-eval
enroot start --rw logikon+cot-eval.sqsh

With Slurm / Apptainer

We're using the following slurm on booster:

#!/bin/bash -x
#SBATCH --account=<PROJECT_ID>
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=4
#SBATCH --output=gpu-out.%j
#SBATCH --error=gpu-err.%j
#SBATCH --time=12:00:00
#SBATCH --partition=booster
#SBATCH --gres=gpu:1

jutil env activate -p <PROJECT_ID>

# create tmp folder to bind with container
mkdir -p $SCRATCH/$SLURM_JOB_ID

apptainer run \
  --nv \
  --env HF_HOME=/mnt/cache/huggingface \
  --env-file $PROJECT/config.env \
  --no-mount home,cwd \
  --bind $SCRATCH/$SLURM_JOB_ID:/mnt \
  --containall \
  $PROJECT/cot-eval.sif bash -c "mkdir /mnt/cache;mkdir /mnt/cache/huggingface;cd /workspace/cot-eval;bash run.sh"

Misc

Build and push Docker image

git clone https://github.com/logikon-ai/cot-eval.git
cd cot-eval
docker build --no-cache -t cot-eval . 
docker login --username logikon
docker tag cot-eval logikon/cot-eval:latest
docker push logikon/cot-eval:latest

🙏 Built with

License

cot-eval is distributed under the terms of the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 234 Commits
.github		.github
assets		assets
eleuther/tasks/logikon		eleuther/tasks/logikon
notebooks		notebooks
scripts		scripts
src/cot_eval		src/cot_eval
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE.txt		LICENSE.txt
README.md		README.md
config.env		config.env
pyproject.toml		pyproject.toml
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CoT-eval

Goal

Pipeline

Installation

Usage

With Docker 🐳

With Enroot

With Slurm / Apptainer

Misc

Build and push Docker image

🙏 Built with

License

About

Releases

Packages

Languages

License

logikon-ai/cot-eval

Folders and files

Latest commit

History

Repository files navigation

CoT-eval

Goal

Pipeline

Installation

Usage

With Docker 🐳

With Enroot

With Slurm / Apptainer

Misc

Build and push Docker image

🙏 Built with

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages