llm-eval-test

A wrapper around lm-eval-harness and Unitxt designed for evaluation of a local inference endpoint.

Requirements

To Install

Python 3.9 or newer

To Run

An OpenAI API-compatible inference server; like vLLM
A directory containing the necessary datasets for the benchmark (see example)

To Develop

poetry

Getting Started

# Create a Virtual Environment
python -m venv venv
source venv/bin/activate

# Install the package
pip install git+https://github.com/sjmonson/llm-eval-test.git

# View run options
llm-eval-test run --help

Usage

usage: llm-eval-test run [-h] [--catalog_path PATH] [--tasks_path PATH] [-v | -q]
-H ENDPOINT -m MODEL -t TASKS -d PATH [-b INT] [-o OUTPUT]

Run tasks

options:
  -h, --help            show this help message and exit
  --catalog_path PATH   unitxt catalog directory
  --tasks_path PATH     lm-eval tasks directory
  -v, --verbose         set loglevel to DEBUG
  -q, --quiet           set loglevel to ERROR
  -b INT, --batch_size INT
                        per-request batch size
  -o OUTPUT, --output OUTPUT
                        results output file

required:
  -H ENDPOINT, --endpoint ENDPOINT
                        OpenAI API-compatible endpoint
  -m MODEL, --model MODEL
                        name of the model under test
  -t TASKS, --tasks TASKS
                        comma separated list of tasks
  -d PATH, --datasets PATH
                        path to dataset storage

Example: MMLU-Pro Benchmark

# Create dataset storage
DATASETS_DIR=$(pwd)/datasets
mkdir $DATASETS_DIR

# Download the MMLU-Pro dataset
DATASET=TIGER-Lab/MMLU-Pro
# Use your preferred method of downloading the dataset to $DATASETS_DIR/$DATASET
# to use the huggingface-cli:
pip install huggingface_hub[cli]
huggingface-cli download $DATASET --repo-type dataset --local-dir $DATASETS_DIR/$DATASET

# Run the benchmark
ENDPOINT=http://127.0.0.1:8000/v1/completions # An OpenAI API-compatable completions endpoint
MODEL_NAME=meta-llama/Llama-3.1-8B # Name of the model hosted on the inference server
llm-eval-test run --endpoint $ENDPOINT --model $MODEL_NAME --datasets $DATASETS_DIR --tasks mmlu_pro

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.github/workflows		.github/workflows
llm_eval_test		llm_eval_test
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-eval-test

Requirements

To Install

To Run

To Develop

Getting Started

Usage

Example: MMLU-Pro Benchmark

About

Releases

Packages

Contributors 2

Languages

License

openshift-psap/llm-eval-test

Folders and files

Latest commit

History

Repository files navigation

llm-eval-test

Requirements

To Install

To Run

To Develop

Getting Started

Usage

Example: MMLU-Pro Benchmark

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages