how can I convert fine-tuned ckpt to huggingface whisper model? #830

circuluspibo · 2023-01-08T09:22:40Z

circuluspibo
Jan 8, 2023

hello. I trained whipser base model for inferencing korean adult speech. (.ckpt)
I wish to convert that model to huggingface whisper model for convenient, but I couldn't find the method for progressing that.
Who one know that solution?

kenkangg · 2023-01-11T04:32:05Z

kenkangg
Jan 11, 2023

I had a similar problem but in the reverse direction (hf -> whisper), but you should be able to reverse this process and apply it to your model.

I had to rename some of the layers of the HuggingFace model to Whisper's naming scheme.
Once I did that I was able to call .load_state_dict from the whisper model and load the renamed huggingface state_dict.
- You might need to mess around with the name mapping. During the last step (.load_state_dict), you'll likely see an error that lists layers from both models that do not have corresponding names. Just find the patterns in the names and add new rules to the mapping function until it successfully loads.

import re
import whisper

def hf_to_whisper_states(text):
    text = re.sub('.layers.', '.blocks.', text)
    text = re.sub('.self_attn.', '.attn.', text)
    text = re.sub('.q_proj.', '.query.', text)
    text = re.sub('.k_proj.', '.key.', text)
    text = re.sub('.v_proj.', '.value.', text)
    text = re.sub('.out_proj.', '.out.', text)
    text = re.sub('.fc1.', '.mlp.0.', text)
    text = re.sub('.fc2.', '.mlp.2.', text)
    text = re.sub('.fc3.', '.mlp.3.', text)
    text = re.sub('.fc3.', '.mlp.3.', text)
    text = re.sub('.encoder_attn.', '.cross_attn.', text)
    text = re.sub('.cross_attn.ln.', '.cross_attn_ln.', text)
    text = re.sub('.embed_positions.weight', '.positional_embedding', text)
    text = re.sub('.embed_tokens.', '.token_embedding.', text)
    text = re.sub('model.', '', text)
    text = re.sub('attn.layer_norm.', 'attn_ln.', text)
    text = re.sub('.final_layer_norm.', '.mlp_ln.', text)
    text = re.sub('encoder.layer_norm.', 'encoder.ln_post.', text)
    text = re.sub('decoder.layer_norm.', 'decoder.ln.', text)
    return text

# Load HF Model
hf_state_dict = torch.load(MODEL_PATH)    # pytorch_model.bin file

# Rename layers
for key in list(hf_state_dict.keys())[:]:
    new_key = hf_to_whisper_states(key)
    hf_state_dict[new_key] = hf_state_dict.pop(key)

# Init Whisper Model and replace model weights
whisper_model = whisper.load_model('large')
whisper_model.load_state_dict(hf_state_dict)

8 replies

RonanKMcGovern Jan 17, 2024

Any updates on a cleaner way to convert from hf to whisper format?

Also, once in whisper format, can I just point to that model when calling whisper from the command line? Thanks

phineas-pta Jan 18, 2024

no and yes respectively

DiwakarBasnet Apr 12, 2024

I tried the mention solution and I got following error does anyone know the solution

RuntimeError: Error(s) in loading state_dict for Whisper:
        Unexpected key(s) in state_dict: "proj_out.weight".

DiwakarBasnet Apr 12, 2024

I solved it by adding following line to the mentioned solution

 text = re.sub('proj_out.weight', 'decoder.token_embedding.weight', text)

but is there a way to check if my model weights got loaded and the model is not just using whisper model?

phineas-pta Apr 12, 2024

u can save the model then load it

AljoSt · 2023-02-08T10:07:38Z

AljoSt
Feb 8, 2023

here is a script for converting openai to hf: https://github.com/huggingface/transformers/blob/main/src/transformers/models/whisper/convert_openai_to_hf.py

0 replies

HarikalarKutusu · 2023-09-27T21:23:33Z

HarikalarKutusu
Sep 27, 2023

This was quite helpful, thank you. Do you have a code to do the reverse, i.e. HF Whisper saved model (.bin file) to Whisper?

I fine-tuned a whisper model with HF Whisper and exported it, and I want to use it in Whisper. It fails to load because it cannot get the "dims" key:

  File "D:\Anaconda\Anaconda3\envs\whisper\lib\site-packages\whisper\__init__.py", line 147, in load_model
    dims = ModelDimensions(**checkpoint["dims"])
KeyError: 'dims'

11 replies

phineas-pta Sep 29, 2023

whisper.load_model cannot load fine-tuned model, must use original model then load_state_dict to overwrite

HarikalarKutusu Sep 29, 2023

Oh my. I totally misunderstood the mechanism. Thank you! I'll try that...

HarikalarKutusu Sep 29, 2023

I had to add the following and it is running a long multi-processing inference now:

    .replace("proj_out.weight", "decoder.token_embedding.weight")

I did not use the code here , I don't know if it matters or what it does, but it seems like an irreversible calculation.

jhdeov Apr 3, 2024

I get the following error:

in load_whisper_model
log(DEBUG, hf_state_dict)
NameError: name 'log' is not defined

HarikalarKutusu Apr 4, 2024

Yeah, it is the Python logger I imported at the top of the code to debug-log stuff in a long running process.
Just comment that line out (put a # in front of it).

sdugoten · 2023-09-29T13:37:45Z

sdugoten
Sep 29, 2023

Here are the relevant functions:

#
# Whisper Model Loader
#

# Replaces HF/Whisper model's layers/parameter names to be compatible with openai/whisper
def hf_to_whisper_states(text): return (text
    .replace("model.", "")
    .replace("layers", "blocks")
    .replace("fc1", "mlp.0")
    .replace("fc2", "mlp.2")
    .replace("final_layer_norm", "mlp_ln")
    .replace(".self_attn.q_proj", ".attn.query")
    .replace(".self_attn.k_proj", ".attn.key")
    .replace(".self_attn.v_proj", ".attn.value")
    .replace(".self_attn_layer_norm", ".attn_ln")
    .replace(".self_attn.out_proj", ".attn.out")
    .replace(".encoder_attn.q_proj", ".cross_attn.query")
    .replace(".encoder_attn.k_proj", ".cross_attn.key")
    .replace(".encoder_attn.v_proj", ".cross_attn.value")
    .replace(".encoder_attn_layer_norm", ".cross_attn_ln")
    .replace(".encoder_attn.out_proj", ".cross_attn.out")
    .replace("decoder.layer_norm.", "decoder.ln.")
    .replace("encoder.layer_norm.", "encoder.ln_post.")
    .replace("embed_tokens", "token_embedding")
    .replace("encoder.embed_positions.weight", "encoder.positional_embedding")
    .replace("decoder.embed_positions.weight", "decoder.positional_embedding")
    .replace("layer_norm", "ln_post")
)

def load_whisper_model(MODEL_PATH: str, cache_dir: str, USE_GPU: bool) -> whisper.Whisper:
    """Loads fine-tuned HF whisper model and converts to openai/whisper for inference"""
    DeviceMode: str = "cuda" if USE_GPU and torch.cuda.is_available() else "cpu"
    # Load HF Model
    hf_state_dict = torch.load(MODEL_PATH, map_location="cpu")    # pytorch_model.bin file created by HF Whisper
    # log(DEBUG, hf_state_dict)
    # Rename layers
    for key in list(hf_state_dict.keys())[:]:
        new_key = hf_to_whisper_states(key)
        hf_state_dict[new_key] = hf_state_dict.pop(key)
    log(DEBUG, hf_state_dict)
    model: whisper.Whisper = whisper.load_model(name=MODEL_PATH, device=DeviceMode, download_root=cache_dir)
    model.load_state_dict(hf_state_dict)
    return model

My code was using multiprocessing (futures), running on CPU and/or GPU, but to get rid of this error I converted it to a single-core version. I already have these:

SET CUDA_VISIBLE_DEVICES=0
SET TRANSFORMERS_OFFLINE=1
...
    gc.collect()
    torch.cuda.empty_cache()

And a decision logic to differentiate original models from fine-tuned ones elsewhere. What I'm trying to do is a round-up for accuracy gains of different splitting algorithms on Common Voice datasets, many languages, many splitting algorithms, CPU and/or GPU, real-time-factors, etc and getting results with jiver into a table.

I appreciate your help in advance.

HarikalarKutusu, if you happen to finally get it to work, please do share the code you use. I want to convert this model to use with faster whisper but none of the script I search on the Net would work

https://huggingface.co/simonl0909/whisper-large-v2-cantonese

Sorry that I am not a programmer so I would have to find some script that i could use....I can't write it on my own. thanks.

1 reply

phineas-pta Sep 29, 2023

convert to faster-whisper is direct, u have problem with python package, i answer in OpenNMT/CTranslate2#1503

jaggzh · 2023-11-06T05:36:20Z

jaggzh
Nov 6, 2023

I don't get a pytorch_model.bin file. I'm using:
https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py

...and I get these files out of it. I want to convert to a Whisper-compatible .pt file but I'm not yet sure how.


     1734 Nov  5 17:12 README.md
    34604 Nov  5 17:12 added_tokens.json
      388 Nov  5 17:12 all_results.json
     1291 Nov  5 17:12 config.json
      219 Nov  5 17:12 eval_results.json
     3716 Nov  5 17:12 generation_config.json
   493864 Nov  5 17:12 merges.txt
151061672 Nov  5 17:12 model.safetensors
    52666 Nov  5 17:12 normalizer.json
      339 Nov  5 17:12 preprocessor_config.json
     4096 Nov  5 17:12 runs
  2842624 Nov  5 17:19 small.pt
     2077 Nov  5 17:12 special_tokens_map.json
  2481187 Nov  5 17:12 tokenizer.json
   282682 Nov  5 17:12 tokenizer_config.json
      190 Nov  5 17:12 train_results.json
      763 Nov  5 17:12 trainer_state.json
     4728 Nov  5 17:12 training_args.bin
   835550 Nov  5 17:12 vocab.json

2 replies

jaggzh Nov 6, 2023

@sdugoten
Gonna direct this at you (I hope you don't mind). How do you go about loading the HF model in order to run your hf_to_whisper_states() on it? (See my prior post for the files I'm getting out of the HF seq2seq training).

HarikalarKutusu Jan 11, 2024

@jaggzh, HF introduced Safe Tensors and it became the default, so no pytorch_model.bin is output, but a model.safetensors file. If you want the original one, use:

        model.save_pretrained(
            save_directory=/to/you/dir,
            ...
            safe_serialization=False,  # <= THIS ONE
        )

HajarMazaheri · 2024-01-11T13:18:48Z

HajarMazaheri
Jan 11, 2024

Hi
I fine-tuned the basic Vesiper model using the Speechbrain toolkit. The results are brain.ckpt, counter.ckpt, dataloader-TRAIN.ckpt, optimizer.ckpt, scheduler_whisper.ckpt, and whisper.ckpt. I want to convert that model to huggingface whisper model. I want to use my model in the code [(https://huggingface.co/blog/fine-tune-whisper)].
What solution do you suggest?

0 replies

ndunks · 2024-07-12T02:10:29Z

ndunks
Jul 12, 2024

Complete script for converting HF model to Whisper:

#!/bin/env python3
import whisper
import re
import torch

def hf_to_whisper_states(text):
    text = re.sub('.layers.', '.blocks.', text)
    text = re.sub('.self_attn.', '.attn.', text)
    text = re.sub('.q_proj.', '.query.', text)
    text = re.sub('.k_proj.', '.key.', text)
    text = re.sub('.v_proj.', '.value.', text)
    text = re.sub('.out_proj.', '.out.', text)
    text = re.sub('.fc1.', '.mlp.0.', text)
    text = re.sub('.fc2.', '.mlp.2.', text)
    text = re.sub('.fc3.', '.mlp.3.', text)
    text = re.sub('.fc3.', '.mlp.3.', text)
    text = re.sub('.encoder_attn.', '.cross_attn.', text)
    text = re.sub('.cross_attn.ln.', '.cross_attn_ln.', text)
    text = re.sub('.embed_positions.weight', '.positional_embedding', text)
    text = re.sub('.embed_tokens.', '.token_embedding.', text)
    text = re.sub('model.', '', text)
    text = re.sub('attn.layer_norm.', 'attn_ln.', text)
    text = re.sub('.final_layer_norm.', '.mlp_ln.', text)
    text = re.sub('encoder.layer_norm.', 'encoder.ln_post.', text)
    text = re.sub('decoder.layer_norm.', 'decoder.ln.', text)
    text = re.sub('proj_out.weight', 'decoder.token_embedding.weight', text)
    return text

# Load HF Model
hf_state_dict = torch.load("whisper-medium-id.bin", map_location=torch.device('cpu'))    # pytorch_model.bin file

# Rename layers
for key in list(hf_state_dict.keys())[:]:
    new_key = hf_to_whisper_states(key)
    hf_state_dict[new_key] = hf_state_dict.pop(key)

model = whisper.load_model('medium')
dims = model.dims
# Save it
torch.save({
    "dims": model.dims.__dict__,
    "model_state_dict": hf_state_dict
}, "whisper-model.bin")

6 replies

rswilem Aug 20, 2024

@FancyCodeMaster On my end that gives me

RuntimeError: Model whisper_model.bin not found; available models = ['tiny.en', 'tiny', 'base.en', 'base', 'small.en', 'small', 'medium.en', 'medium', 'large-v1', 'large-v2', 'large-v3', 'large']

So I'm still kind of confused. 😅

ndunks Aug 22, 2024

whisper_model.bin is need to be created first. The file is generated after your run "script for converting HF model to Whisper".

rswilem Aug 22, 2024

Yes, I got that. But whisper_model.bin is an invalid value for whisper.load_model, even when I put it in the same directory. Got it working like follows though:

model = whisper.load_model("tiny")

# Loading our checkpoint state dict here.
finetune_model = WhisperForConditionalGeneration.from_pretrained("./checkpoint-2000")
model.load_state_dict(finetune_model.model.state_dict())
result = model.transcribe("audio.mp3", language="nl")

So what happens here is that I load the tiny model, then align the state dict with my checkpoint.

ndunks Aug 23, 2024

Make sure your file name is correct, whisper_model.bin is not same as whisper-model.bin (dash vs underscore).

rswilem Aug 23, 2024

Wow, I missed that completely, thank you. That seems to work indeed.

ousaf66 · 2024-10-07T01:20:26Z

ousaf66
Oct 7, 2024

how to ['proj_out.weight'] ?

5 replies

HarikalarKutusu Oct 7, 2024

I had that answered above:

I had to add the following and it is running a long multi-processing inference now:
    .replace("proj_out.weight", "decoder.token_embedding.weight")
I did not use the code here , I don't know if it matters or what it does, but it seems like an irreversible calculation.

ousaf66 Oct 7, 2024

import os
import pandas as pd
from datasets import DatasetDict, Dataset, Audio
from transformers import WhisperFeatureExtractor, WhisperTokenizer
import torch
import re
from typing import List, Dict, Union, Any
from transformers import WhisperProcessor, WhisperForConditionalGeneration, Seq2SeqTrainer, Seq2SeqTrainingArguments
import evaluate

Define paths

csv_file_path = '/content/drive/MyDrive/dataset/overview-of-recordings.csv' # Your CSV file path
base_train = "/content/drive/My Drive/dataset/Train1" # Training audio files
base_validate = "/content/drive/My Drive/dataset/validate1" # Validation audio files

Load the CSV file

data = pd.read_csv(csv_file_path)
print("Initial CSV data:")
print(data.head())

Function to filter dataset based on available audio files

def filter_available_files(df, base_path):
available_files = set(os.listdir(base_path))
print(f"Available files in {base_path}: {available_files}")
return df[df['file_name'].isin(available_files)]

Filter rows where the audio files exist

train_data = filter_available_files(data, base_train)
validate_data = filter_available_files(data, base_validate)

Create a DatasetDict

common_voice = DatasetDict()
common_voice['train'] = Dataset.from_pandas(train_data[['file_name', 'phrase']])
common_voice['test'] = Dataset.from_pandas(validate_data[['file_name', 'phrase']])

Function to construct full audio paths

def construct_audio_path(batch, base_train, base_validate):
train_files = os.listdir(base_train)
validate_files = os.listdir(base_validate)

if batch["file_name"] in train_files:
    batch["audio_path"] = os.path.join(base_train, batch["file_name"])
elif batch["file_name"] in validate_files:
    batch["audio_path"] = os.path.join(base_validate, batch["file_name"])
else:
    batch["audio_path"] = None
return batch

Apply the function to construct full paths

common_voice["train"] = common_voice["train"].map(lambda batch: construct_audio_path(batch, base_train, base_validate))
common_voice["test"] = common_voice["test"].map(lambda batch: construct_audio_path(batch, base_train, base_validate))

Cast 'audio_path' as an Audio feature with specified sampling rate

common_voice = common_voice.cast_column("audio_path", Audio(sampling_rate=16000))

Load the Whisper feature extractor and tokenizer

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small", language="English", task="transcribe")

Function to prepare the dataset by extracting features and labels

def prepare_dataset(batch):
if batch["audio_path"] is not None:
batch["input_features"] = feature_extractor(batch["audio_path"]["array"], sampling_rate=16000).input_features[0]
batch["labels"] = tokenizer(batch["phrase"]).input_ids
return batch

Process the datasets

train_dataset = common_voice["train"].map(
prepare_dataset,
remove_columns=common_voice["train"].column_names,
num_proc=2
)

test_dataset = common_voice["test"].map(
prepare_dataset,
remove_columns=common_voice["test"].column_names,
num_proc=2
)

Loading the processor and model

processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")

Set the generation config for English transcription

model.generation_config.language = "english"
model.generation_config.task = "transcribe"
model.generation_config.forced_decoder_ids = None

Define your data collator class

@DataClass
class DataCollatorSpeechSeq2SeqWithPadding:
processor: Any
decoder_start_token_id: int

def __init__(self, processor: Any, decoder_start_token_id: int):
    self.processor = processor
    self.decoder_start_token_id = decoder_start_token_id

def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
    input_features = [{"input_features": feature["input_features"]} for feature in features]
    batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

    label_features = [{"input_ids": feature["labels"]} for feature in features]
    labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

    labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

    if (labels[:, 0] == self.decoder_start_token_id).all().cpu().item():
        labels = labels[:, 1:]

    batch["labels"] = labels
    return batch

Compute WER metric

metric = evaluate.load("wer")

def compute_metrics(pred):
pred_ids = pred.predictions
label_ids = pred.label_ids
label_ids[label_ids == -100] = processor.tokenizer.pad_token_id

pred_str = processor.tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
label_str = processor.tokenizer.batch_decode(label_ids, skip_special_tokens=True)

wer = 100 * metric.compute(predictions=pred_str, references=label_str)
return {"wer": wer}

training_args = Seq2SeqTrainingArguments(
output_dir="content/drive/My Drive/whispermodel",
per_device_train_batch_size=16,
gradient_accumulation_steps=1,
learning_rate=1e-5,
warmup_steps=500,
max_steps=5,
gradient_checkpointing=True,
fp16=True,
evaluation_strategy="steps",
per_device_eval_batch_size=8,
predict_with_generate=True,
generation_max_length=225,
save_steps=1,
eval_steps=1,
logging_steps=5,
report_to=["tensorboard"],
load_best_model_at_end=True,
metric_for_best_model="wer",
greater_is_better=False,
push_to_hub=False,
)

Create an instance of the data collator

data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor, decoder_start_token_id=model.config.decoder_start_token_id)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=test_dataset,
tokenizer=processor.feature_extractor,
data_collator=data_collator,
compute_metrics=compute_metrics
)

Start training

trainer.train()

answer
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:295: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead.
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:295: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead.
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:295: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead.
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:295: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead.
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
There were missing keys in the checkpoint model loaded: ['proj_out.weight'].
TrainOutput(global_step=5, training_loss=1.2975360870361328, metrics={'train_runtime': 190.5231, 'train_samples_per_second': 0.42, 'train_steps_per_second': 0.026, 'total_flos': 1.4429270016e+16, 'train_loss': 1.2975360870361328, 'epoch': 5.0})

result i am getting. I am not cloning any directory where to make this change. getting confuse.
some people have also use this approach. save_safetensors=False, dont know what it does

HarikalarKutusu Oct 7, 2024

There were missing keys in the checkpoint model loaded: ['proj_out.weight'].

I gave the answer above, add that to the conversion mapping

some people have also use this approach. save_safetensors=False, dont know what it does

I also answered that here:
#830 (reply in thread)

What other problem are you facing? What does confuse you?

ousaf66 Oct 7, 2024

which approach is better between two. where should i replace this in my code replace("proj_out.weight", "decoder.token_embedding.weight")

HarikalarKutusu Oct 8, 2024

which approach is better between two.

For new code: Safetensors, it is better for the future. I used the old one because I needed compatibility in my toolchain at that time.

where should i replace

This whole thread is regarding whisper-HF style model layer conversions, to and from. E.g. if you get a Whisper model, fine-tune it in HuggingFace and want to use the resultant model back in Whisper inference, you would need the conversions.

If you need these, define two functions whisper2hf and hf2whisper like we did above to map the layers. Please read the whole thread.