Warning
Currently, this work is in progress.
This repository contains training code for the XEUS model for Automatic Speech Recognition (ASR). This is a fork of https://github.com/pashanitw/xeus-finetune
- python3.11, python3.11-dev
- build-essential, cmake
- uv
- git-lfs
Note
Python 3.12 cannot be used because one of the dependencies in ESPnet relies on an old package.
uv venv --python 3.11
source .venv/bin/activate
# install espnet
git clone --branch ssl --depth 1 https://github.com/wanchichen/espnet espnet-code
cd espnet-code
git fetch --unshallow
uv pip install -e .
# download XEUS checkpoint
git clone https://huggingface.co/espnet/XEUS
# install required packages
uv pip install -r requirements.txt
# in development mode install additional packages
uv pip install -r requirements-dev.txt
- Authenticate with HF
huggingface-cli login
- Copy a config file, change dataset sections and hparams
cp configs/hi_hf.yaml configs/uk_hf.yaml
- Start fine-tuning
accelerate launch finetune.py --config configs/uk_hf.yaml
# if you want to use only one GPU
accelerate launch --num_processes 1 finetune.py --config configs/uk_hf.yaml
python inference.py --ckpt_path <checkpoint path> --audio audio.wav
# example
python inference.py --ckpt_path ./step_2000 --audio audio.wav
Run the following command to calculate Word Error Rate:
python eval.py --ckpt_path <checkpoint path> --dataset <dataset> --name <subset> --split <split>
# example
python eval.py --ckpt_path ./step_2000 --dataset mozilla-foundation/common_voice_17_0 --name uk --split test
Check and format the code:
ruff check
ruff format
- Enable Flash-Attention for training
- Set a cache_dir for
load_dataset