Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alltalkbeta #288

Merged
merged 9 commits into from
Oct 20, 2024
Merged

Alltalkbeta #288

merged 9 commits into from
Oct 20, 2024

Conversation

IIEleven11
Copy link

You'll see two scripts. compare_and_merge.py and expand_xtts.py.

I didn't do any integration with alltalk so these scripts are capable of running as is, standalone.

steps to use

  1. Run start_finetune and check the "bpe_tokenizer" box to train a new tokenizer during transcription
  2. Begin transcription
  3. When transcription is complete you will have a bpe_tokenizer-vocab.json
  4. Open compare_and_merge.py and fill in the file paths for the base model files and the new vocab.
  5. Run compare_and_merge.py
  6. You now have an expanded_vocab.json.
  7. Open expand_xtts.py and fill in the file paths
  8. Run expand_xtts.py

You now have an expanded base xttsv2 "expanded_model.pth" and its pair "expanded_vocab.json"
The base xttsv2 model needs to be removed from the file path "/alltalk_tts/models/xtts/xttsv2_2.0.3/model.pth"
The base "vocab.json" needs to be removed from the file path "/alltalk_tts/models/xtts/xttsv2_2.0.3/vocab.json"
Place "expanded_model.pth" and "expanded_vocab.json" in the place of the removed base model/vocab at path "/alltalk_tts/models/xtts/xttsv2_2.0.3/". Rename them to "model.pth" and "vocab.json".

That's it you can now begin fine tuning.

You'll find each file commented with more detail about what's going on. Finetune.py had an edit i was using to rotate the port because when using an online instance, when I have to end the script the port can linger blocked. Which causes the script to fail and I have to go in and change the port. So just setting a range from port # - port # fixes that issue. But I removed it as it's beyond the scope of this specific PR. I can send it in another if that's something you want to implement.

@IIEleven11
Copy link
Author

Ignore my finetune.py script changes. I reverted them.

So this solution worked with no slurred speech and no accent with the 2.0.2 model. I believe the accent with the 2.0.3 model was inherent with the base model and not specific to this solution.

You'll see a new custom_tokenizer.py. This script needs a txt file that's run through the extract_dataset_for_tokenizer.py script. This will remove the first and third columns from the csv's. Output will be your new custom datasets vocab.json. Use this with compare and merge script then expand_xtts script and begin training.

As far as the 2.0.3 model. It remains unknown and I fear will always remain that way as Coqui has exited the party. So it might be wise to revert the model from 2.0.3 as default to 2.0.2.

I had to do a lot of learning here with these so I am cautious and open to the possibility i missed something. Especially with the creation of the new tokenizer. So if anyone has anything to point out please do.

@erew123
Copy link
Owner

erew123 commented Aug 3, 2024

Hi @IIEleven11

Sorry its taken a while to respond, some days Im busy elsewhere and some days I wake up and there's 10+ messages to deal with before I get to even look at anything.

If Im interpreting what you've said correctly, it will work fine on the 2.0.2 model, but 2.0.3 goes a bit funny. The only differences I know of with the 2.0.3 model was that they introduced 2x new languages, which I think were Hungarian and Korean https://docs.coqui.ai/en/latest/models/xtts.html#updates-with-v2

But actually, they added 3x languages. Hindi was added too, but not documented anywhere apart from here https://huggingface.co/coqui/XTTS-v2#languages (that I ever found).

As there is no difference in the training setup that identifies differences between the models (that I know of) would you think that means there would be something different in the config.json or vocab.json that perhaps is the difference that maybe makes 2.0.3 funny to train?

Apologies for the questions Im just digging into the knowledge youve learned and wondering if I can think of anything that may help solve the puzzle.

That aside, thanks for all your work on this! I will test it soon. :)

@IIEleven11
Copy link
Author

Yeah so check coqui-ai/TTS#3309 (comment).
They do acknowledge there was some fall back. Specifically when adding new languages/speakers.

I am curious what would happen if we removed the other than English tokens from the vocab.json. they take up a very large amount of space. I would think it will allow for more English vocabulary and therefore a better English speaking model. Will incur many requests asking for multi lingual support though.

The configs and vocabs for each version of the model are different the 2.0.2 vocab has a smaller size and smaller embedding layer. So they aren't compatible for inference or trainining without adjusting the architecture of the model.

There's a couple of other fine tuning webuis that also default to 2.0.2. Daswers fine tuning webui for example.

But yeah more testing of course. I only used it with a single dataset. I think allowing the community to go at it would be a good solution for now as we can only really confirm with more testing. We are somewhat working blind with whatever information Coqui left behind.

@erew123
Copy link
Owner

erew123 commented Aug 4, 2024

I can tell you why we both used the 2.0.2 model at the time of creating the interfaces. The 2.0.3 model had something bad/wrong released in the models configuration (or something) that created very very bad audio. The solution back then was to use 2.0.2 and Coqui did resolve 2.0.3 eventually, however it was just easier to stick on 2.0.2 at the time, rather than re-code.

@IIEleven11
Copy link
Author

I can tell you why we both used the 2.0.2 model at the time of creating the interfaces. The 2.0.3 model had something bad/wrong released in the models configuration (or something) that created very very bad audio. The solution back then was to use 2.0.2 and Coqui did resolve 2.0.3 eventually, however it was just easier to stick on 2.0.2 at the time, rather than re-code.

Ahh I did see you comment back then, yeah. The accent within the voice could very well have been an error somewhere on my part. I don't want to remove that from the equation.

The 2.0.3 model has pros and cons. I think it has a greater ability to meet a wider range of people's needs than 2.0.2 because it does have a slightly bigger vocab. But this means it's potential is possibly lesser than 2.0.2.

The big reason I'm hesitant to provide what I did to remove all but English tokens in the vocab.json is because I am not confident that I completely understood all the changes I made. While it did most certainly work, some of it I just said "that looks right" and moved on. Training models is really complex and I just want to make sure I'm not providing code that will give someone a harder time due to my ignorance.

@erew123
Copy link
Owner

erew123 commented Aug 10, 2024

Hi @IIEleven11 Hope you are keeping well. Apologies for not catching up with you, Its been a busy week for me with quite a few requests/issues with lots of things.

Thanks for the updates above, do you think its now time for me to merge/test this out?

Thanks

@IIEleven11
Copy link
Author

Hi @IIEleven11 Hope you are keeping well. Apologies for not catching up with you, Its been a busy week for me with quite a few requests/issues with lots of things.

Thanks for the updates above, do you think its now time for me to merge/test this out?

Thanks

Yeah I would really love It if another developer would really look into it with me. I've been trying to essentially reverse engineer coqui's code and would love another mind to collaborate with.

I have tested it a few more times since then. Adding vocabulary works as expected.

One thing though. I am trying to add a new special token which is proving to be a bit more nuanced.

I would guess most users don't try and do this though so it shouldn't be a problem for now.

@IIEleven11
Copy link
Author

I also saw you were deep into the conversation at one point in some really old commits. Do you know anything about the loss of ability to prompt engineer the model between tortoise and xtts?

Things like "[joy] it's nice to meet you!" Would generate an emotional joyous sentence. Tortoise can do it. Xttsv2 paid API could do it. But now we can't do it.

This is what I've been trying to solve. It would appear they removed this functionality from the open source versions. And because the tortoise and xtts models are nearly identical I believe we could put the pieces together to get it back.

@erew123
Copy link
Owner

erew123 commented Aug 11, 2024

Hi @IIEleven11 Spent my morning cleaning up after spilling coffee all over my desk, computer, keyboard, wall, floor etc.... :/ so lost a few hours of my day where I was hoping to respond properly, look into a few things etc. How annoying!

Anyway, first off, I found this conversation earlier coqui-ai/TTS#3704 I wonder if that may be of interest??

As for emotions, I didn't know they HAD implemented them at some point in the past but it must have been on the roadmap according to this coqui-ai/TTS#3255 and I can see it on the roadmap coqui-ai/TTS#378 as Implement emotion and style adaptation. in the as yet uncompleted "milestones along the way".

To add to all this, eginhard https://github.com/eginhard is currently maintaining TTS and the Coqui scripts. He is not someone whom worked for Coqui (as I understand) he is just passionate about TTS and the Coqui model. He also appears to be doing quite a bit of work on the trainers/finetuning https://github.com/idiap/coqui-ai-TTS/commits/dev/ (yet to be released). Im not sure how involved he may want to be with another project, but, I suspect he knows quite a bit about the trainer and probably knows/has figured out quite a bit about the model....... Maybe he might be a good person for us to ask a few questions to (should he have time). I suppose we could pose any questions there, if you agree that could be a good path?

@IIEleven11
Copy link
Author

Awesome! Thanks for the leads. Yeah that's a good idea.

I did just make a breakthrough though that kind of confirms some of my theories.

I trained an xttsv2 model that can whisper using a custom special token "[whisper]". So I think this means that we can technically make any special token including for emotions.

The only difference being tortoise can just do many emotions and these tokens are nowhere to be found within its vocab.json but yet it knoww exactly how to handle them.

Anyways, so my conclusion with this new tokenizer is if people want to train new vocabulary they need a significant amount of data. 4 or 5 hours only works partially it will lose the ability to say generate certain sounds while gaining the ability to say others. This is negated with more data. It looks like somewhere around 15 to 20 hours give or take would be more ideal.

@erew123
Copy link
Owner

erew123 commented Aug 13, 2024

Wow! Training it to emote, that's pretty cool!

Re your conclusion though, that sounds similar to what I read about training an entirely new language into the model, without fully training all other languages at the same time. I imagine you need a hell of a lot of compute to build out a base model for this.

@erew123
Copy link
Owner

erew123 commented Aug 27, 2024

Hi @IIEleven11 Hope you are well. Apologies again, Im struggling to get near code/deal with support at the moment. I dont want to air my life on the internet, however for the past few months, I have a ongoing situation that has me traveling+away from my own home and computer, providing help/care for a family member.

If you feel this should be merged in, I am happy to do so, as long as you feel its bug free. I can give it a run through when possible and check all works.

If there is anything specific you would like me to try look at or help you figure, please give me a list of items and I will try to do so.

I will get to it as soon as I can.

All the best

@IIEleven11
Copy link
Author

Hi @IIEleven11 Hope you are well. Apologies again, Im struggling to get near code/deal with support at the moment. I dont want to air my life on the internet, however for the past few months, I have a ongoing situation that has me traveling+away from my own home and computer, providing help/care for a family member.

If you feel this should be merged in, I am happy to do so, as long as you feel its bug free. I can give it a run through when possible and check all works.

If there is anything specific you would like me to try look at or help you figure, please give me a list of items and I will try to do so.

I will get to it as soon as I can.

All the best

Oh sorry, actually I have an update for it that solves the model losing the ability to speak specific words. We need to freeze the embeddings layers of the base model prior to training. After I push that to this though, you could merge it but it isn't integrated into your webui. So if anyone wants to use the process they would need to run each script alone. I could maybe work on integrating it with your code, I don't expect it to be too difficult (famous last words). I am just swamped with clients at the moment and am about to release my own personal project. If I can get to it though I will.

@erew123
Copy link
Owner

erew123 commented Sep 22, 2024

Hi @IIEleven11

Hope you are keeping well! :)

I'm back for a few days, before heading off again. Sorry I havnt gotten around to this. Turns out when you go away for a while, there is quite a backlog of things to deal with when you return!

Should I be pulling this merge in now and sending it live?

Thanks

@IIEleven11
Copy link
Author

Sorry yeah I;ve been busy too. So I have done quite a bit of testing and the results are good. The asterisk though is I did it with English and a single speaker. There will most certainly be nuances when fine tuning with a different language. Also, I still haven't incorporated it into your interface. It's is going to require a little bit of shuffling around and choosing which base model.

But if you do want to merge it and let people who are capable of using the scripts as just standalones for now, it should be fine. Maybe make a quick note in the UI that this whole process can still be a bit difficult to grasp. I tried to make it as automatic as possible but the quality of their results is still going to depend upon their dataset and how they curated it. I would maybe point them to this video first, so they get a grasp of what theyre actually doing. https://youtu.be/zduSFxRajkE?si=K2NF8V1wrR_RTfWH

@erew123
Copy link
Owner

erew123 commented Oct 3, 2024

@IIEleven11 Still not had an opportunity to pull this in, test etc. Im still bouncing about like a Ping-Pong ball with my unwell family situation. What I have at least managed to do (without my main computer) is write a hell of a load of documentation on the Wiki, to try keep my requests for information/support down. https://github.com/erew123/alltalk_tts/wiki

I'm intending to pull down your updates on the Finetuning and also Ill do a larger section of the wiki on XTTS finetuning (probably mostly pulled from old written content and whats in finetuning, as well as linking to that video you gave above). If there is anything else you think I should include LMK.

Honestly, sorry and sorry for not pulling this in yet. Its just a case of getting time to properly test it, and as soon as Im away for X days, I come back to 20+ emails from people on here (hence deciding its time to write the wiki). I will get there, promise!!

@Mixomo
Copy link

Mixomo commented Oct 6, 2024

Hello @IIEleven11 I'm moving my question from #362 to here.

Before proceeding with the question, I have read this thread and saw that you put some instructions at the beginning, and I don't know if they still apply.
On the other hand, I know that progress is being made on the tokenizer part, so no need for a quick reply, I'll just leave my message here so I can keep track of progress and future PRs and merges related to this topic.


My question is not about how it works per se, but to know if indeed all talk uses the BPE tokenizer that has been trained in the inference, or embeds it somehow in the vocab.json or in the weights?

Since from what I was seeing, at the time of fine-tuning, all talk always uses the vocab.json of the base model (original or custom), and if then in the inference I manually point to the path of the vocab bpe custom, it gives me a missmatch error.
I don't even know if the reasoning I am trying to apply is correct.

Thank you very much in advance.

raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for Xtts: size mismatch for gpt.text_embedding.weight: copying a param with shape torch.Size([6681, 1024]) from checkpoint, the shape in current model is torch.Size([1431, 1024]). size mismatch for gpt.text_head.weight: copying a param with shape torch.Size([6681, 1024]) from checkpoint, the shape in current model is torch.Size([1431, 1024]). size mismatch for gpt.text_head.bias: copying a param with shape torch.Size([6681]) from checkpoint, the shape in current model is torch.Size([1431]).

image

The trained tokenizer:
bpe_tokenizer-vocab.json

The used tokenizer:
vocab.json

P.S:
And it is not because of the file names: bpe_tokenizer-vocab.json or renaming it to vocab.json gives the same error.

@IIEleven11
Copy link
Author

Hello @IIEleven11 I'm moving my question from #362 to here.

Before proceeding with the question, I have read this thread and saw that you put some instructions at the beginning, and I don't know if they still apply.
On the other hand, I know that progress is being made on the tokenizer part, so no need for a quick reply, I'll just leave my message here so I can keep track of progress and future PRs and merges related to this topic.


My question is not about how it works per se, but to know if indeed all talk uses the BPE tokenizer that has been trained in the inference, or embeds it somehow in the vocab.json or in the weights?

Since from what I was seeing, at the time of fine-tuning, all talk always uses the vocab.json of the base model (original or custom), and if then in the inference I manually point to the path of the vocab bpe custom, it gives me a missmatch error.
I don't even know if the reasoning I am trying to apply is correct.

Thank you very much in advance.

raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for Xtts: size mismatch for gpt.text_embedding.weight: copying a param with shape torch.Size([6681, 1024]) from checkpoint, the shape in current model is torch.Size([1431, 1024]). size mismatch for gpt.text_head.weight: copying a param with shape torch.Size([6681, 1024]) from checkpoint, the shape in current model is torch.Size([1431, 1024]). size mismatch for gpt.text_head.bias: copying a param with shape torch.Size([6681]) from checkpoint, the shape in current model is torch.Size([1431]).

image

The trained tokenizer:
bpe_tokenizer-vocab.json

The used tokenizer:
vocab.json

P.S:
And it is not because of the file names: bpe_tokenizer-vocab.json or renaming it to vocab.json gives the same error.

If you clone the branch I used to send the PR then those scripts should work for you.

As for your error, if you attempted to use the default process for training the new tokenizer then the error you got is consistent with what this PR is attempting to fix.

This happens because the base model was not being expanded according to the new vocabulary. This results in the size mismatch you got.

The process is
make a new vocab.json

merge it with the base model vocab.json

then freeze the base model except the embeddings layers

then expand the models embeddings layers using the vocab you merged.

Then you can begin fine tuning with your newly expanded model and its vocab.

@Mixomo
Copy link

Mixomo commented Oct 7, 2024

@IIEleven11

UPDATE:
I followed all your instructions, and without giving any errors in both scripts and training, the inference ends up being noise.

bug.all.talk-1.mp4

trainer_0_log.txt

The only thing I can mention is that I modified the scripts so that they can handle utf-8 files (since the language is Spanish and has accents).

https://gist.github.com/Mixomo/e6a82c6a373ed8a8925cc5eb12176d79

The base model was a custom one dedicated to Spanish, and while I'm not sure what exact version of XTTS V2 it was, I think it's the same, otherwise it wouldn't have let me train, right?

The version of Coqui AI TTS that I have is the new one that came out a few days ago, maybe that is the reason? Should I go back to the previous version?

What I will do now is to train with the original XTTS base model, to see if I get different results.

Thanks

expanded_vocab.json

@Mixomo
Copy link

Mixomo commented Oct 7, 2024

UPDATE # 2:

Training from the original base model worked, however I notice that speech does not have the same flexibility as training it with the original tokenizer and the Spanish base model, as it skips words and/or syllables.

@IIEleven11
Copy link
Author

@IIEleven11

UPDATE: I followed all your instructions, and without giving any errors in both scripts and training, the inference ends up being noise.

bug.all.talk-1.mp4
trainer_0_log.txt

The only thing I can mention is that I modified the scripts so that they can handle utf-8 files (since the language is Spanish and has accents).

https://gist.github.com/Mixomo/e6a82c6a373ed8a8925cc5eb12176d79

The base model was a custom one dedicated to Spanish, and while I'm not sure what exact version of XTTS V2 it was, I think it's the same, otherwise it wouldn't have let me train, right?

The version of Coqui AI TTS that I have is the new one that came out a few days ago, maybe that is the reason? Should I go back to the previous version?

What I will do now is to train with the original XTTS base model, to see if I get different results.

Thanks

expanded_vocab.json

Yeah I wouldn't attempt to train a new tokenizer from a model that isn't the base model. Its not impossible, but there would be other nuances you would need to address.

As for the new model you made, I am glad it worked for you, although there were some errors. What I guess is happening is that the model is getting new vocabulary but not enough data to train/learn on that vocabulary which results in what you're hearing. The answer to this problem is just to provide it with a significant amount of training data.

For reference I trained a model with a special token [whisper]. Where I gave it 40 hours of pure whispering. I had attempted it a few times prior with less data and got sub par results. It either had no idea what that token meant or would only work sometimes. So my theory is that you should be giving it somewhere between 30-40 hours or more to train on. I understand this is not a small number for the average person but when we consider in relation to the amount of training data the base model had and all other models in general, it actually is a very small number.

@erew123
Copy link
Owner

erew123 commented Oct 14, 2024

@IIEleven11 I don't think I'm going to get a chance to test this for a while, so I'm happy just to pull it in. Obviously @Mixomo has tested it now and it clearly worked through, so Im sure it will be fine for most use cases. I had to put this statement up about my current situation and Ive been fire fighting to try deal with support issues on Github when I can.

I want to try write some Finetuning Wiki stuff for people, probably a mix of the existing instructions, the video you linked and I guess I should add any other detail. I can and have been writing the Wiki https://github.com/erew123/alltalk_tts/wiki as I can do that with just a laptop. @IIEleven11 if you have any thoughts for anything to add, let me know, but Im going to give things 48 hours for to calm down here on Github, then Im going to merge this in, assuming all is well and quiet again!

Thanks so much again!! :)

@erew123
Copy link
Owner

erew123 commented Oct 14, 2024

@IIEleven11 Oh, not sure if this makes any sense to you or what you think about it #368 Ive not been able to look at this at all. Im not suggesting you do anything, but if you have any thoughts on it, Id be happy to hear them. Thanks

@erew123
Copy link
Owner

erew123 commented Oct 14, 2024

@IIEleven11 oh and maybe this is something that idiap who manages the Coqui scripts and base coqui scripts needs to look at, rather than anything in the Finetuning here......

@IIEleven11
Copy link
Author

@IIEleven11 Oh, not sure if this makes any sense to you or what you think about it #368 Ive not been able to look at this at all. Im not suggesting you do anything, but if you have any thoughts on it, Id be happy to hear them. Thanks

Hope all is well man, no rush, life is life.

As far as teaching people how to train models. It's always more complex then it appears. The tokenizer video is great. I have another one on overfitting https://www.youtube.com/watch?v=Gf5DO6br0ts. Ive been training/finetuning models for awhile now and if I had to pick what single biggest factor of a quality model is, it would be the dataset, by an extremely large amount. All of their time should be spent making sure it's pristine. As in, its segmented well, clear/noiseless audio, includes audio that spans the entire phonemic spectrum, has a gaussian distribution of audio length/text, proper sample rate, etc.

I actually have a repo where I make attempts to automate the dataset curation process. At its core its a bit complex but the theory is abstracting all of that away. https://github.com/IIEleven11/Automatic-Audio-Dataset-Maker.git. It does by default, spit out a xttsv2 dataset format as well as a huggingface hub dataset. So it should work for users right now out of the box.

As for the prodigy optimizer. I briefly looked it over and while it appears to be a slot in LR option and it working with pytorch. I highly doubt actually implementing it with all of the alltalk models will be a simple task. It is just a way to automatically adjust learning rate. There's no guarantee it will be more ideal than manually adjusting learning rate/using a scheduler. But, if it is a slot in simple addition/improvement, then sure why not?

@erew123 erew123 merged commit fbffe0a into erew123:alltalkbeta Oct 20, 2024
@erew123
Copy link
Owner

erew123 commented Oct 20, 2024

Hi @IIEleven11 I've fiiiiiiiiinally pulled in the PR :) I had a few busy days and a suspicion I may have updated something between your PR and the code base at some point, so just wanted to check that before pulling it in (I hadn't made any changes it appears).

I had to make 2x small changes and also added a line to make Gradio quiet about the fact there is a new Gradio version and to update etc. bb314fa

Over the next few days, Im hoping to get time to digest your suggestions for documentation and hopefully get something written, though Im going to be doing a bit of catch up first with other support requests, package version changes etc. (probably going to test out Pytorch 2.4 and a couple of other things) and then get the documentation written.

Obviously merging in the PR closes the PR, but ill catch you back here or feel free to catch me back here).

I just want to say thanks again for working on this! Thanks for being patient with me taking my time to merge the code in etc!

@IIEleven11
Copy link
Author

Nice!
Oh those changes look like what I do when I train models on the cloud. "test_sentences[]" is for tensorflow/tensorboard. It makes the evaluation steps generate a sample audio file during training which can be listened too mid training in the tensorboard server. Then the other change is code i add to start the script with a shared gradio link except it looks like I changed it back to what I thought it was
This would be the code to create a shared gradio link that can be accessed from a cloud instance
demo.queue().launch(
show_api=False,
inbrowser=True,
share=True,
debug=False,
server_port=7052,
server_name="0.0.0.0",
)

appears I left it like this
    demo.queue().launch(
    show_api=False,
    inbrowser=True,
    share=False,
    debug=False,
    server_port=7052,
    server_name="127.0.0.1",
)

which is the default and would only work locally. (127.0.0.1 and share=False)

Totally my oversight though, i meant to remove those. My bad.

Glad youre back though, let me know if you have any questions

@erew123
Copy link
Owner

erew123 commented Nov 10, 2024

@IIEleven11

I've made a couple of updates to finetuning. Nothing that over-rides anything you have done. I still need to do a bit of work to repair terminal console output (that I damaged).


Data validation section is massively improved:

You can edit/manage the metadata files all in the one page/interface now.

image


Wav files

All wav files now get dumped out and torchaudio actually looks at the wav file length to do that (its really quick):

image

along with an audio report on the wav files:

WAV Files Processing Report
===========================

This folder contains WAV files categorized by their duration for use with Coqui TTS.
Suitable files for voice cloning should be between 6 and 30 seconds in length.

Directory Structure:
------------------
- 'suitable': Contains files between 6-30 seconds - ideal for voice cloning
- 'too_short': Files under 6 seconds - may not contain enough voice characteristics
- 'too_long': Files over 30 seconds - may cause processing issues

Voice Sample Usage:
-----------------
1. The 'suitable' directory contains the ideal voice samples:
   - These files are ready to use for voice cloning
   - Copy your preferred samples to '/alltalk_tts/voices/' for use in the main interface
   - Clean, clear samples with minimal background noise work best
   - Consider using multiple samples to test which gives the best results

2. Files in 'too_long':
   - Can be used but may cause issues or inconsistent results
   - Recommended: Use audio editing software (like Audacity) to:
     * Split these into 6-30 second segments
     * Remove any silence, background noise, or unwanted sounds
     * Save segments as individual WAV files
     * Consider overlap in sentences for more natural breaks

3. Files in 'too_short':
   - Not recommended for voice cloning as they lack sufficient voice characteristics
   - If most/all files are here, consider:
     * Recording longer samples of continuous speech
     * Combining multiple short segments (if they flow naturally)
     * Using audio editing software to create longer cohesive samples
     * Aim for clear, natural speech between 6-30 seconds

Best Practices:
-------------
- Choose samples with clear, consistent speech
- Avoid background noise, music, or other speakers
- Natural speaking pace and tone usually work best
- Multiple samples of varying lengths (within 6-30s) can provide better results
- Test different samples to find which produces the best voice cloning results

Summary:
--------

Too Short files (44):
- interview_00000001.wav: 2.06 seconds
- interview_00000003.wav: 1.57 seconds
etc....

Too Long files (0):

Suitable files (13):
- interview_00000002.wav: 7.07 seconds
- interview_00000011.wav: 8.6 seconds
etc....

Notes:
------
- Files in 'suitable' are ready for use with Coqui TTS
- Files in 'too_short' are under 6 seconds and may need to be checked or excluded
- Files in 'too_long' are over 30 seconds and may need to be split or excluded

Please review the files in 'too_short' and 'too_long' directories.

WIKI Pages

I've taken a first round shot at doing a simple guide and a very detailed guide, based off some of the things you pointed me towards in the past.

https://github.com/erew123/alltalk_tts/wiki/XTTS-Model-Finetuning-Guide-(Simple-Version)
https://github.com/erew123/alltalk_tts/wiki/XTTS-Model-Finetuning-Guide-(Advanced-Version)

I've used a mix of myself and AI to write it. I have given it a few reads, but there is a lot to get through, some of it is heavy going and above my pay grade.

You're welcome to give it a glance (if you want) and tell me anything to add/change/remove (not sure if someone else can edit the wiki or not). You are also welcome to completely ignore it :)

@erew123
Copy link
Owner

erew123 commented Nov 14, 2024

@IIEleven11 For what its worth, I have done a huge re-work of training. Ill be uploading it soon, but I have quite a bit of work to do to post it up, mostly documentation updates. However, first off, all your bits remain the same. Most of the rework is for dataset generation, documentation, visuals and layout.

image

image

Help sections are detailed and easy to pick out:

image

image

as for the training guide, well, theres now a lot of it up there!!

image

and its nice and detailed:

image

image

I've gone over everything in the documentation and interface..... It should be pretty decent!

Will get it posted in the next 24-48 hours.

@IIEleven11
Copy link
Author

@IIEleven11 For what its worth, I have done a huge re-work of training. Ill be uploading it soon, but I have quite a bit of work to do to post it up, mostly documentation updates. However, first off, all your bits remain the same. Most of the rework is for dataset generation, documentation, visuals and layout.

image

image

Help sections are detailed and easy to pick out:

image

image

as for the training guide, well, theres now a lot of it up there!!

image

and its nice and detailed:

image

image

I've gone over everything in the documentation and interface..... It should be pretty decent!

Will get it posted in the next 24-48 hours.

I have to commend you, you are extremely organized and thorough. It is an admirable quality. It looks really good man.

If i may some give some constructive points of criticism or concepts to maybe look into...

  • The ideal segmentation strategy is a gaussian distribution of auio/text. In your case the range is 6-30 seconds so most segments should be around the 18 second point and fan out from there to each end. There is a small issue though because that range may be a bit off. We will naturally speak in very short words or sentences in some cases and we want the model to have examples of this type of speech. The general idea here is that were trying to cover the entire spectrum of possible speech patterns, prosody, tones, etc...
  • Also, xttsv2 has a token and length cap far under 30 seconds. Its confusing because within their code and documents it does appear that 30 seconds is allowed. I think what happened was they used functions from other models source code and those models allow for greater depth. Because you are supporting several different models I would maybe add an asterisk where different models prefer different ranges and 6-30 seconds is not absolute and may be more or less than ideal. A general range I would use is 1.2 seconds to ~18 seconds.

Here is my specific solution for gaussian distribution of audio length segments. It uses .srt transcription as part of the segmentation process then goes into forcing the distribution. If you decide to dive deeper and give the end user the ability to do more advanced level curation, I find it very effective.

A final point of criticism... The project is getting larger and more complex, you're adding layers of abstraction that make it more difficult for advancers users to work with. For example, I wanted to train on the xttsv2 model recently. It was working locally but not in a cloud instance. The traceback/error output seemed possibly be truncated or maybe output somewhere else. After some print debugging I had to just move to something else. My theory is that this is a linux specific issue. It was having trouble reading the model for some reason. Adding print debugging on larger and larger codebases that are not your own can sometimes not be so obvious.
Anyways, if i missed some debugging tools you provided or if they just arent there then it might be worth making them more apparent or working on including some developer level options to help debug.

Again though, well done. It looks great and is coming together nicely.

def segment_audio_and_create_metadata(SRT_DIR_PATH, AUDIO_DIR_PATH, WAVS_DIR_PREDENOISE, PARENT_CSV, SPEAKER_NAME):
    """
    Audio segmentation using Gaussian distribution for segment durations.
    """
    logger.info("Starting audio segmentation and metadata creation...")
    os.makedirs(WAVS_DIR_PREDENOISE, exist_ok=True)
    metadata_entries = []


    def parse_srt(srt_file_path):
        """Parse the .srt file and return a list of subtitles with start and end times in seconds."""
        subtitles = pysrt.open(srt_file_path)
        subs = []
        for sub in subtitles:
            start_time = sub.start.ordinal / 1000.0
            end_time = sub.end.ordinal / 1000.0
            text = sub.text.replace('\n', ' ').strip()
            subs.append({'start': start_time, 'end': end_time, 'text': text})
        return subs


    def generate_gaussian_durations(total_duration, min_length=2, max_length=18):
        """Generate segment durations following a truncated Gaussian distribution."""
        mean = (min_length + max_length) / 2
        std_dev = (max_length - min_length) / 6
        durations = []
        accumulated = 0
        while accumulated < total_duration:
            duration = np.random.normal(mean, std_dev)
            duration = max(min(duration, max_length), min_length)
            
            remaining = total_duration - accumulated
            
            if remaining < min_length:
                if durations:
                    if durations[-1] + remaining <= max_length:
                        durations[-1] += remaining
                break
            if accumulated + duration > total_duration:
                remaining = total_duration - accumulated
                if min_length <= remaining <= max_length:
                    durations.append(remaining)
                elif remaining > max_length:
                    while remaining > 0:
                        if remaining > max_length:
                            durations.append(max_length)
                            remaining -= max_length
                        else:
                            if remaining >= min_length:
                                durations.append(remaining)
                            elif durations:
                                durations[-1] += remaining
                            break
                break
            durations.append(duration)
            accumulated += duration
        return durations


    def adjust_segments(subs, durations):
        """Adjust the segments to match the desired durations."""
        adjusted_segments = []
        i = 0
        num_subs = len(subs)
        start_time = subs[0]['start']
        
        while i < num_subs:
            if not durations:
                break
                
            segment_duration = durations.pop(0)
            target_end_time = start_time + segment_duration
            
            current_segment = {
                'start': start_time,
                'text': '',
                'end': start_time
            }
            
            while i < num_subs:
                current_segment['text'] += ' ' + subs[i]['text']
                current_segment['end'] = min(subs[i]['end'], start_time + 18)  # force max duration
                
                if subs[i]['end'] >= target_end_time or current_segment['end'] - current_segment['start'] >= 18:
                    break
                i += 1
                
            segment_duration = current_segment['end'] - current_segment['start']
            if 2 <= segment_duration <= 18:
                current_segment['text'] = current_segment['text'].strip()
                adjusted_segments.append(current_segment)

            i += 1
            if i < num_subs:
                start_time = subs[i]['start']

        return adjusted_segments

    srt_files = [f for f in os.listdir(SRT_DIR_PATH) if f.endswith('.srt')]
    for srt_file in tqdm(srt_files, desc="Processing audio files"):
        srt_file_path = os.path.join(SRT_DIR_PATH, srt_file)
        base_name = os.path.splitext(srt_file)[0]
        wav_file = base_name + '.wav'
        wav_file_path = os.path.join(AUDIO_DIR_PATH, wav_file)
        
        if not os.path.exists(wav_file_path):
            logger.warning(f'Audio file {wav_file} does not exist. Skipping.')
            continue

        subs = parse_srt(srt_file_path)
        if not subs:
            logger.warning(f'No subtitles found in {srt_file}. Skipping.')
            continue

        audio = AudioSegment.from_wav(wav_file_path)
        total_duration = len(audio) / 1000.0  # Convert to seconds
        durations = generate_gaussian_durations(total_duration)
        adjusted_segments = adjust_segments(subs, durations)

        # Process and export audio segments
        for idx, segment in enumerate(adjusted_segments):
            start_ms = segment['start'] * 1000
            end_ms = segment['end'] * 1000
            audio_segment = audio[start_ms:end_ms]
            output_filename = f"{base_name}_{idx+1}.wav"
            output_path = os.path.join(WAVS_DIR_PREDENOISE, output_filename)
            audio_segment.export(output_path, format="wav")
            
            metadata_entries.append({
                'audio': output_path,
                'text': segment['text'],
                'speaker_name': SPEAKER_NAME
            })
        
        logger.info(f'Processed {wav_file} into {len(adjusted_segments)} segments.')

    os.makedirs(PARENT_CSV, exist_ok=True)
    metadata_df = pd.DataFrame(metadata_entries)
    metadata_df.to_csv(os.path.join(PARENT_CSV, "metadata.csv"), sep='|', index=False)
    logger.info(f'Metadata saved to {os.path.join(PARENT_CSV, "metadata.csv")}')

@erew123
Copy link
Owner

erew123 commented Nov 17, 2024

@IIEleven11 Hope you are well and thanks for the feedback :) This was a lot more than a visual update with added documentation, it was actually a 40+ hour re-write code alone, Still have the WIKI documentation to tidy up :/. The update may well already cover off things you mention above (examples below). Also the code inside is a hell of a lot more documented now. If you think of anything else that's missing, let me know. And of course, thanks for your help in the past (be it you use it or not).

All the best

Linux

It should all be fine on linux, will have to double check it, but I cant think of an issue. There are quite a lot of checks and balances in place now though. I cant say nothing could go wrong, but it should be pretty solid. There are certainly extra checks in place for finding models and checking all the files are there/telling you if not.

PFC

I did update the PFC to give clear "this is whats wrong" messages and also below it are quick help sections too.
image

Transcription Update

To be honest the whole dataset transcription was re-written, pretty much from the ground up. I've changed over from Faster-Whisper, which dropped 1.5GB off the installation requirements. Re min/man audio, min was there in the back end code, but I have added it in as a user selectable and researched/set the audio length as defaults for dataset/audio creation (thanks for the pointer). This is how transcription/dataset creation works now though:

  • process all audio files through Silero VAD to accurately detect speech segments
  • merge or split these segments based on user-selected minimum and maximum durations (with 0.3s-0.5s buffers to protect speech edges). The buffer expands if needed.
  • each segment gets an initial Whisper transcription with word-level timestamps
  • segments are then validated and any audio duplicates get a second Whisper transcription pass to clear out incorrect transcriptions & duplicate files.
  • All segments undergo length validation (against user-selected min/max) and transcription and split/merged if necessary
  • Bad segments are filtered out, preserving only high-quality audio-text pairs
  • Finally, the training/evaluation split into the metadata files gets done.

Added Min/Max audio as user configurable (as you suggested).

image

Dataset generation debugging

I did quite a bit of work on debugging and logic. Info, Warnings & Errors are always logged out, the rest is selectable as you wish.

image

[FINETUNE] [INFO] Initializing output directory: E:\newtest\alltalk_tts\finetune\testproject
[FINETUNE] [MODEL] Using device: cuda
[FINETUNE] [MODEL] Loading Whisper model: large-v3
[FINETUNE] [MODEL] Using mixed precision
[FINETUNE] [MODEL] Initializing Silero VAD
[FINETUNE] [GPU] GPU Memory Status:
[FINETUNE] [GPU] Total: 12282.00 MB
[FINETUNE] [GPU] Used:  10541.60 MB
[FINETUNE] [GPU] Free:  1740.40 MB
[FINETUNE] [INFO] Updated language to: en
[FINETUNE] [AUDIO] Found 1 audio files to process
[FINETUNE] [INFO] Processing: elon_original
[FINETUNE] [AUDIO] Original audio duration: 138.55s
[FINETUNE] [AUDIO] Processing with VAD
[FINETUNE] [SEG] Merged 0 segments into 21 segments with mid-range preference
[FINETUNE] [SEG] VAD processing: 21 original segments, 21 after merging
[FINETUNE] [SEG] Merged 0 segments into 21 segments with mid-range preference
[FINETUNE] [SEG] Merged 0 short segments
[FINETUNE] WARNING: [SEG] Segment too short (1.08s), attempting to extend
[FINETUNE] [SEG] Extended segment from 1.08s to 2.69s
[FINETUNE] WARNING: [SEG] Segment too short (1.17s), attempting to extend
[FINETUNE] [SEG] Extended segment from 1.17s to 3.61s
[FINETUNE] WARNING: [SEG] Segment too short (1.21s), attempting to extend
[FINETUNE] [SEG] Extended segment from 1.21s to 4.00s
[FINETUNE] WARNING: [SEG] Segment too short (1.26s), attempting to extend
[FINETUNE] [SEG] Extended segment from 1.26s to 4.00s
[FINETUNE] WARNING: [SEG] Segment too short (1.41s), attempting to extend
[FINETUNE] [SEG] Extended segment from 1.41s to 4.00s
[FINETUNE] WARNING: [SEG] Segment too short (1.28s), attempting to extend
[FINETUNE] [SEG] Extended segment from 1.28s to 4.00s
[FINETUNE] WARNING: [SEG] Segment too short (1.22s), attempting to extend
[FINETUNE] [SEG] Extended segment from 1.22s to 4.00s
[FINETUNE] [SEG] Processed chunk 0 (2.69s)
[FINETUNE] [SEG] Processed chunk 1 (3.61s)
[FINETUNE] WARNING: [SEG] Long segment: 18.18s
[FINETUNE] [SEG] Processed chunk 2 (18.18s)
[FINETUNE] [SEG] Processed chunk 3 (7.72s)
[FINETUNE] [SEG] Processed chunk 4 (2.87s)
[FINETUNE] [SEG] Processed chunk 5 (4.00s)
[FINETUNE] [SEG] Processed chunk 6 (4.00s)
[FINETUNE] [SEG] Processed chunk 7 (6.52s)
[FINETUNE] [SEG] Processed chunk 8 (9.65s)
[FINETUNE] [SEG] Processed chunk 9 (3.36s)
[FINETUNE] [SEG] Processed chunk 10 (4.00s)
[FINETUNE] [SEG] Processed chunk 11 (9.44s)
[FINETUNE] [SEG] Processed chunk 12 (9.38s)
[FINETUNE] [SEG] Processed chunk 13 (6.34s)
[FINETUNE] WARNING: [SEG] Long segment: 18.05s
[FINETUNE] [SEG] Processed chunk 14 (18.05s)
[FINETUNE] [SEG] Processed chunk 15 (4.00s)
[FINETUNE] [SEG] Processed chunk 16 (4.00s)
[FINETUNE] [SEG] Processed chunk 17 (5.65s)
[FINETUNE] [SEG] Processed chunk 18 (6.26s)
[FINETUNE] WARNING: [SEG] Long segment: 12.30s
[FINETUNE] [SEG] Processed chunk 19 (12.30s)
[FINETUNE] WARNING: [SEG] Long segment: 16.43s
[FINETUNE] [SEG] Processed chunk 20 (16.43s)
[FINETUNE] [AUDIO] Audio Processing Statistics:
[FINETUNE] [AUDIO] Total segments: 21
[FINETUNE] [AUDIO] Average duration: 7.55s
[FINETUNE] [AUDIO] Segments under minimum: 0
[FINETUNE] [AUDIO] Segments over maximum: 4
[FINETUNE] [AUDIO] Audio Processing Statistics:
[FINETUNE] [AUDIO] Total segments: 21
[FINETUNE] [AUDIO] Average duration: 7.55s
[FINETUNE] [AUDIO] Segments under minimum: 0
[FINETUNE] [AUDIO] Segments over maximum: 4
[FINETUNE] [DATA] Processing metadata and handling duplicates
[FINETUNE] [DUP] Found 9 files with multiple transcriptions
[FINETUNE] [DUP] wavs/elon_original_00000005.wav: 3 occurrences
[FINETUNE] [DUP] wavs/elon_original_00000003.wav: 2 occurrences
[FINETUNE] [DUP] wavs/elon_original_00000008.wav: 2 occurrences
[FINETUNE] [DUP] wavs/elon_original_00000010.wav: 2 occurrences
[FINETUNE] [DUP] wavs/elon_original_00000021.wav: 2 occurrences
[FINETUNE] [DUP] wavs/elon_original_00000020.wav: 2 occurrences
[FINETUNE] [DUP] wavs/elon_original_00000019.wav: 2 occurrences
[FINETUNE] [DUP] wavs/elon_original_00000017.wav: 2 occurrences
[FINETUNE] [DUP] wavs/elon_original_00000018.wav: 2 occurrences
[FINETUNE] [DUP] Re-transcribing duplicate files to get best transcription
[FINETUNE] [DUP] Re-transcribing wavs/elon_original_00000005.wav
[FINETUNE] [DUP] Re-transcribing wavs/elon_original_00000003.wav
[FINETUNE] [DUP] Re-transcribing wavs/elon_original_00000008.wav
[FINETUNE] [DUP] Re-transcribing wavs/elon_original_00000010.wav
[FINETUNE] [DUP] Re-transcribing wavs/elon_original_00000021.wav
[FINETUNE] [DUP] Re-transcribing wavs/elon_original_00000020.wav
[FINETUNE] [DUP] Re-transcribing wavs/elon_original_00000019.wav
[FINETUNE] [DUP] Re-transcribing wavs/elon_original_00000017.wav
[FINETUNE] [DUP] Re-transcribing wavs/elon_original_00000018.wav
[FINETUNE] [DUP] Updated transcription for wavs/elon_original_00000005.wav
[FINETUNE] [DUP] Updated transcription for wavs/elon_original_00000003.wav
[FINETUNE] [DUP] Updated transcription for wavs/elon_original_00000008.wav
[FINETUNE] [DUP] Updated transcription for wavs/elon_original_00000010.wav
[FINETUNE] [DUP] Updated transcription for wavs/elon_original_00000021.wav
[FINETUNE] [DUP] Updated transcription for wavs/elon_original_00000020.wav
[FINETUNE] [DUP] Updated transcription for wavs/elon_original_00000019.wav
[FINETUNE] [DUP] Updated transcription for wavs/elon_original_00000017.wav
[FINETUNE] [DUP] Updated transcription for wavs/elon_original_00000018.wav
[FINETUNE] [DUP] Cleaned up 9 duplicate entries
[FINETUNE] [DATA] Creating train/eval split
[FINETUNE] [DATA] Writing 20 training and 3 eval samples
[FINETUNE] [DATA] Successfully wrote metadata files
[FINETUNE] [MODEL] Training BPE Tokenizer
[FINETUNE] [MODEL] Initializing BPE tokenizer training
[FINETUNE] [MODEL] Training tokenizer on 408 words
[00:00:00] Pre-processing sequences       ██████████████████████████████████████████████████████████ 0        /        0[00:00:00] Tokenize words                 ██████████████████████████████████████████████████████████ 201      /      201
[00:00:00] Count pairs                    ██████████████████████████████████████████████████████████ 201      /      201
[00:00:00] Compute merges                 ██████████████████████████████████████████████████████████ 203      /      203
[FINETUNE] [MODEL] Saved BPE tokenizer to E:\newtest\alltalk_tts\finetune\testproject\bpe_tokenizer-vocab.json
[FINETUNE] [INFO] Finalizing processing
[FINETUNE] [SEG] Files that were split due to length:
[FINETUNE] [SEG]   elon_original_00000005.wav: 10.04 seconds
[FINETUNE] [SEG]   elon_original_00000015.wav: 10.32 seconds
[FINETUNE] [GPU] GPU Memory After Cleanup:
[FINETUNE] [GPU] GPU Memory Status:
[FINETUNE] [GPU] Total: 12282.00 MB
[FINETUNE] [GPU] Used:  1153.41 MB
[FINETUNE] [GPU] Free:  11128.59 MB
[FINETUNE] [DATA] Dataset Generated. Either run Dataset Validation or move to Step 2

Training debugging

Although I had put some extra logic and debugging in, I expanded it a little tonight and put some dropdowns there so you could pick/choose what you want.

image

[FINETUNE] [MODEL] ✓ All required files found for model: xttsv2_2.0.3
[FINETUNE] [MODEL] ***********************
[FINETUNE] [MODEL] Training Configuration:
[FINETUNE] [MODEL] ***********************
[FINETUNE] [MODEL] - Language: en
[FINETUNE] [MODEL] - Epochs: 4
[FINETUNE] [MODEL] - Batch Size: 4
[FINETUNE] [MODEL] - Gradient Accumulation: 1
[FINETUNE] [MODEL] - Learning Rate: 5e-06
[FINETUNE] [MODEL] - Learning Rate Scheduler: CosineAnnealingWarmRestarts
[FINETUNE] [MODEL] - Optimizer: AdamW
[FINETUNE] [MODEL] - Number of Workers: 8
[FINETUNE] [MODEL] - Warm Up: False
[FINETUNE] [MODEL] - Max Audio Length: 242550
[FINETUNE] [GPU] ****************
[FINETUNE] [GPU] GPU Information:
[FINETUNE] [GPU] ****************
[FINETUNE] [GPU] - Device: NVIDIA GeForce RTX 4070
[FINETUNE] [GPU] - CUDA Version: 12.1
[FINETUNE] [GPU] - Total VRAM: 11.99GB
[FINETUNE] [GPU] - Free VRAM: 11.99GB
[FINETUNE] WARNING: [GPU] ******************************
[FINETUNE] WARNING: [GPU] IMPORTANT MEMORY CONSIDERATION
[FINETUNE] WARNING: [GPU] ******************************
[FINETUNE] WARNING: [GPU] Your available VRAM is below the recommended 12GB threshold.
[FINETUNE] WARNING: [GPU]
[FINETUNE] WARNING: [GPU] System-Specific Considerations:
[FINETUNE] WARNING: [GPU] - Windows: Will utilize system RAM as extended VRAM
[FINETUNE] WARNING: [GPU]   * Ensure sufficient system RAM is available
[FINETUNE] WARNING: [GPU]   * Recommended minimum: 24GB system RAM
[FINETUNE] WARNING: [GPU] - Linux: Limited to physical VRAM only
[FINETUNE] WARNING: [GPU]   * Training may fail with insufficient VRAM
[FINETUNE] WARNING: [GPU]   * Consider reducing batch size or using gradient accumulation
[FINETUNE] WARNING: [GPU]
[FINETUNE] WARNING: [GPU] For detailed memory management strategies and optimization tips:
[FINETUNE] WARNING: [GPU] 1. Refer to the 'Memory Management' section in the Training Guide
[FINETUNE] WARNING: [GPU] 2. Review the Pre-flight Check tab for system requirements
[FINETUNE] [DATA] *******************
[FINETUNE] [DATA] Dataset Statistics:
[FINETUNE] [DATA] *******************
[FINETUNE] [DATA] - Training samples: 20
[FINETUNE] [DATA] - Evaluation samples: 3
[FINETUNE] [DATA] - Using custom BPE tokenizer
[FINETUNE] WARNING: [DATA] Very small training dataset
[FINETUNE] WARNING: [DATA] Very small evaluation dataset
[FINETUNE] [INFO] **********************
[FINETUNE] [INFO] Project Configuration:
[FINETUNE] [INFO] **********************
[FINETUNE] [INFO] - Project Path: E:\newtest\alltalk_tts\finetune\testproject\training
[FINETUNE] [INFO] - Model Path: E:\newtest\alltalk_tts\models\xtts\xttsv2_2.0.3
[FINETUNE] [INFO] - Training Data: E:\newtest\alltalk_tts\finetune\testproject\metadata_train.csv
[FINETUNE] [INFO] - Evaluation Data: E:\newtest\alltalk_tts\finetune\testproject\metadata_eval.csv
[FINETUNE] [INFO] - Language: en
[FINETUNE] [INFO] - Found language file, using language: en
[FINETUNE] [INFO] - Batch Size: 4
[FINETUNE] [INFO] - Grad Steps: 1
[FINETUNE] [INFO] - Training Epochs: 4
[FINETUNE] [INFO] - Learning Scheduler CosineAnnealingWarmRestarts Parameters
[FINETUNE] [INFO] - {'T_0': 1, 'T_mult': 1, 'eta_min': 1e-06, 'last_epoch': -1}
[FINETUNE] DVAE weights restored from: E:\newtest\alltalk_tts\models\xtts\xttsv2_2.0.3\dvae.pth
[FINETUNE] [MODEL] Loading training samples...
[FINETUNE] Found 20 files in E:\newtest\alltalk_tts\finetune\testproject
[FINETUNE] [MODEL] Loaded 20 training and 3 eval samples

 > Training Environment:
 | > Backend: Torch
 | > Mixed precision: False
 | > Precision: float32
 | > Current device: 0
 | > Num. of GPUs: 1
 | > Num. of CPUs: 16
 | > Num. of Torch Threads: 1
 | > Torch seed: 1
 | > Torch CUDNN: True
 | > Torch CUDNN deterministic: False
 | > Torch CUDNN benchmark: False
 | > Torch TF32 MatMul: False
 > Start Tensorboard: tensorboard --logdir=E:\newtest\alltalk_tts\finetune\testproject\training\XTTS_FT-November-17-2024_04+00AM-f938f83

 > Model has 517360175 parameters


[FINETUNE] [MODEL] ********************************
[FINETUNE] [MODEL] Starting training the XTTS model
[FINETUNE] [MODEL] ********************************


 > EPOCH: 0/4
 --> E:\newtest\alltalk_tts\finetune\testproject\training\XTTS_FT-November-17-2024_04+00AM-f938f83
[FINETUNE] Sampling by language: dict_keys(['en'])

 > TRAINING (2024-11-17 04:00:23)

@Mixomo
Copy link

Mixomo commented Dec 30, 2024

Hello, with the current version of all talk, is it still necessary to use the conversion and merge scripts to train a custom BPE tokenizer?
Since it has never been clarified nor provided the files in the documentation / wiki, however, the option to train custom BPE is still in the webui.

It would also be nice to be able to train the BPE without using the transcription stage in cases where the user already has the dataset assembled in the path that all talk expects.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants