Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to replace PL-BERT with XPhoneBERT? #28

Closed
bharathraj-v opened this issue Nov 14, 2023 · 3 comments
Closed

How to replace PL-BERT with XPhoneBERT? #28

bharathraj-v opened this issue Nov 14, 2023 · 3 comments

Comments

@bharathraj-v
Copy link

Hi,

I'm looking to generate Hindi audio but it was mentioned that PL-BERT doesn't work well with other languages and I either need to train a different PL-BERT or replace the module with XPhoneBERT.

I'm having trouble understanding how I could go about replacing the module with XPhoneBERT. The repository XPhoneBERT describes using the model through transformers but I'm unsure how I can apply that here and this issue thread suggests that the pre-trained model is not publicized so, how do I go about replacing PL-BERT with XPhoneBERT here?

Thanks!

@bharathraj-v bharathraj-v changed the title How do I replace PL-BERT with XPhoneBERT? How to replace PL-BERT with XPhoneBERT? Nov 15, 2023
@yl4579
Copy link
Owner

yl4579 commented Nov 15, 2023

Unfortunately this is not a straightforward replacement because the phoneimzer between PL-BERT and XPhoneBERT is quite different. You will have to re-train the text aligner (ASR) with the XPhoneBERT phonemizer and also prepare your data in that format, then you can replace PL-BERT with XPhoneBERT.

@yl4579 yl4579 closed this as completed Nov 15, 2023
@yl4579
Copy link
Owner

yl4579 commented Nov 15, 2023

The model is publicly available here: https://huggingface.co/vinai/xphonebert-base

@cmp-nct
Copy link

cmp-nct commented Nov 20, 2023

The readme made it sound like a drop in replacement ;-)

@yl4579
It would be nice to get a few more steps, given many people have never trained any audio type models.
It's all a bit overwhelming

Here is a new utils.py for the xphonebert that acts like the previous utils.py

import os
from transformers import AutoConfig, AutoModelForMaskedLM


class CustomXPhoneBERT(AutoModelForMaskedLM):
    def forward(self, *args, **kwargs):
        # Call the original forward method
        outputs = super().forward(*args, **kwargs)

        # Only return the last_hidden_state
        return outputs.last_hidden_state


def load_xbert(model_name_or_path):
    # Load the configuration for 'xphonebert-base'
    config = AutoConfig.from_pretrained(model_name_or_path)

    # Initialize the custom XPhoneBERT model using the configuration
    xbert = CustomXPhoneBERT.from_pretrained(model_name_or_path, config=config)

    # Return the custom model
    return xbert
The inference won't be as compatible I guess, that's the current inference code which relies on the english-only bert:
    with torch.no_grad():
        input_lengths = torch.LongTensor([tokens.shape[-1]]).to(device)
        text_mask = length_to_mask(input_lengths).to(device)

        t_en = model.text_encoder(tokens, input_lengths, text_mask)
        bert_dur = model.bert(tokens, attention_mask=(~text_mask).int())
        d_en = model.bert_encoder(bert_dur).transpose(-1, -2)

        s_pred = sampler(noise=torch.randn((1, 256)).unsqueeze(1).to(device),
                         embedding=bert_dur,
                         embedding_scale=embedding_scale,
                         features=ref_s,  # reference from the same speaker as the embedding
                         num_steps=diffusion_steps).squeeze(1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants