-
-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alltalkbeta #288
Alltalkbeta #288
Conversation
Fixed vocab.json merge, new tokenizer for custom dataset, dataset cleaner.
Ignore my finetune.py script changes. I reverted them. So this solution worked with no slurred speech and no accent with the 2.0.2 model. I believe the accent with the 2.0.3 model was inherent with the base model and not specific to this solution. You'll see a new custom_tokenizer.py. This script needs a txt file that's run through the extract_dataset_for_tokenizer.py script. This will remove the first and third columns from the csv's. Output will be your new custom datasets vocab.json. Use this with compare and merge script then expand_xtts script and begin training. As far as the 2.0.3 model. It remains unknown and I fear will always remain that way as Coqui has exited the party. So it might be wise to revert the model from 2.0.3 as default to 2.0.2. I had to do a lot of learning here with these so I am cautious and open to the possibility i missed something. Especially with the creation of the new tokenizer. So if anyone has anything to point out please do. |
Hi @IIEleven11 Sorry its taken a while to respond, some days Im busy elsewhere and some days I wake up and there's 10+ messages to deal with before I get to even look at anything. If Im interpreting what you've said correctly, it will work fine on the 2.0.2 model, but 2.0.3 goes a bit funny. The only differences I know of with the 2.0.3 model was that they introduced 2x new languages, which I think were Hungarian and Korean https://docs.coqui.ai/en/latest/models/xtts.html#updates-with-v2 But actually, they added 3x languages. Hindi was added too, but not documented anywhere apart from here https://huggingface.co/coqui/XTTS-v2#languages (that I ever found). As there is no difference in the training setup that identifies differences between the models (that I know of) would you think that means there would be something different in the config.json or vocab.json that perhaps is the difference that maybe makes 2.0.3 funny to train? Apologies for the questions Im just digging into the knowledge youve learned and wondering if I can think of anything that may help solve the puzzle. That aside, thanks for all your work on this! I will test it soon. :) |
Yeah so check coqui-ai/TTS#3309 (comment). I am curious what would happen if we removed the other than English tokens from the vocab.json. they take up a very large amount of space. I would think it will allow for more English vocabulary and therefore a better English speaking model. Will incur many requests asking for multi lingual support though. The configs and vocabs for each version of the model are different the 2.0.2 vocab has a smaller size and smaller embedding layer. So they aren't compatible for inference or trainining without adjusting the architecture of the model. There's a couple of other fine tuning webuis that also default to 2.0.2. Daswers fine tuning webui for example. But yeah more testing of course. I only used it with a single dataset. I think allowing the community to go at it would be a good solution for now as we can only really confirm with more testing. We are somewhat working blind with whatever information Coqui left behind. |
I can tell you why we both used the 2.0.2 model at the time of creating the interfaces. The 2.0.3 model had something bad/wrong released in the models configuration (or something) that created very very bad audio. The solution back then was to use 2.0.2 and Coqui did resolve 2.0.3 eventually, however it was just easier to stick on 2.0.2 at the time, rather than re-code. |
Ahh I did see you comment back then, yeah. The accent within the voice could very well have been an error somewhere on my part. I don't want to remove that from the equation. The 2.0.3 model has pros and cons. I think it has a greater ability to meet a wider range of people's needs than 2.0.2 because it does have a slightly bigger vocab. But this means it's potential is possibly lesser than 2.0.2. The big reason I'm hesitant to provide what I did to remove all but English tokens in the vocab.json is because I am not confident that I completely understood all the changes I made. While it did most certainly work, some of it I just said "that looks right" and moved on. Training models is really complex and I just want to make sure I'm not providing code that will give someone a harder time due to my ignorance. |
Hi @IIEleven11 Hope you are keeping well. Apologies for not catching up with you, Its been a busy week for me with quite a few requests/issues with lots of things. Thanks for the updates above, do you think its now time for me to merge/test this out? Thanks |
Yeah I would really love It if another developer would really look into it with me. I've been trying to essentially reverse engineer coqui's code and would love another mind to collaborate with. I have tested it a few more times since then. Adding vocabulary works as expected. One thing though. I am trying to add a new special token which is proving to be a bit more nuanced. I would guess most users don't try and do this though so it shouldn't be a problem for now. |
I also saw you were deep into the conversation at one point in some really old commits. Do you know anything about the loss of ability to prompt engineer the model between tortoise and xtts? Things like "[joy] it's nice to meet you!" Would generate an emotional joyous sentence. Tortoise can do it. Xttsv2 paid API could do it. But now we can't do it. This is what I've been trying to solve. It would appear they removed this functionality from the open source versions. And because the tortoise and xtts models are nearly identical I believe we could put the pieces together to get it back. |
Hi @IIEleven11 Spent my morning cleaning up after spilling coffee all over my desk, computer, keyboard, wall, floor etc.... :/ so lost a few hours of my day where I was hoping to respond properly, look into a few things etc. How annoying! Anyway, first off, I found this conversation earlier coqui-ai/TTS#3704 I wonder if that may be of interest?? As for emotions, I didn't know they HAD implemented them at some point in the past but it must have been on the roadmap according to this coqui-ai/TTS#3255 and I can see it on the roadmap coqui-ai/TTS#378 as To add to all this, eginhard https://github.com/eginhard is currently maintaining TTS and the Coqui scripts. He is not someone whom worked for Coqui (as I understand) he is just passionate about TTS and the Coqui model. He also appears to be doing quite a bit of work on the trainers/finetuning https://github.com/idiap/coqui-ai-TTS/commits/dev/ (yet to be released). Im not sure how involved he may want to be with another project, but, I suspect he knows quite a bit about the trainer and probably knows/has figured out quite a bit about the model....... Maybe he might be a good person for us to ask a few questions to (should he have time). I suppose we could pose any questions there, if you agree that could be a good path? |
Awesome! Thanks for the leads. Yeah that's a good idea. I did just make a breakthrough though that kind of confirms some of my theories. I trained an xttsv2 model that can whisper using a custom special token "[whisper]". So I think this means that we can technically make any special token including for emotions. The only difference being tortoise can just do many emotions and these tokens are nowhere to be found within its vocab.json but yet it knoww exactly how to handle them. Anyways, so my conclusion with this new tokenizer is if people want to train new vocabulary they need a significant amount of data. 4 or 5 hours only works partially it will lose the ability to say generate certain sounds while gaining the ability to say others. This is negated with more data. It looks like somewhere around 15 to 20 hours give or take would be more ideal. |
Wow! Training it to emote, that's pretty cool! Re your conclusion though, that sounds similar to what I read about training an entirely new language into the model, without fully training all other languages at the same time. I imagine you need a hell of a lot of compute to build out a base model for this. |
Hi @IIEleven11 Hope you are well. Apologies again, Im struggling to get near code/deal with support at the moment. I dont want to air my life on the internet, however for the past few months, I have a ongoing situation that has me traveling+away from my own home and computer, providing help/care for a family member. If you feel this should be merged in, I am happy to do so, as long as you feel its bug free. I can give it a run through when possible and check all works. If there is anything specific you would like me to try look at or help you figure, please give me a list of items and I will try to do so. I will get to it as soon as I can. All the best |
Oh sorry, actually I have an update for it that solves the model losing the ability to speak specific words. We need to freeze the embeddings layers of the base model prior to training. After I push that to this though, you could merge it but it isn't integrated into your webui. So if anyone wants to use the process they would need to run each script alone. I could maybe work on integrating it with your code, I don't expect it to be too difficult (famous last words). I am just swamped with clients at the moment and am about to release my own personal project. If I can get to it though I will. |
Hi @IIEleven11 Hope you are keeping well! :) I'm back for a few days, before heading off again. Sorry I havnt gotten around to this. Turns out when you go away for a while, there is quite a backlog of things to deal with when you return! Should I be pulling this merge in now and sending it live? Thanks |
Sorry yeah I;ve been busy too. So I have done quite a bit of testing and the results are good. The asterisk though is I did it with English and a single speaker. There will most certainly be nuances when fine tuning with a different language. Also, I still haven't incorporated it into your interface. It's is going to require a little bit of shuffling around and choosing which base model. But if you do want to merge it and let people who are capable of using the scripts as just standalones for now, it should be fine. Maybe make a quick note in the UI that this whole process can still be a bit difficult to grasp. I tried to make it as automatic as possible but the quality of their results is still going to depend upon their dataset and how they curated it. I would maybe point them to this video first, so they get a grasp of what theyre actually doing. https://youtu.be/zduSFxRajkE?si=K2NF8V1wrR_RTfWH |
@IIEleven11 Still not had an opportunity to pull this in, test etc. Im still bouncing about like a Ping-Pong ball with my unwell family situation. What I have at least managed to do (without my main computer) is write a hell of a load of documentation on the Wiki, to try keep my requests for information/support down. https://github.com/erew123/alltalk_tts/wiki I'm intending to pull down your updates on the Finetuning and also Ill do a larger section of the wiki on XTTS finetuning (probably mostly pulled from old written content and whats in finetuning, as well as linking to that video you gave above). If there is anything else you think I should include LMK. Honestly, sorry and sorry for not pulling this in yet. Its just a case of getting time to properly test it, and as soon as Im away for X days, I come back to 20+ emails from people on here (hence deciding its time to write the wiki). I will get there, promise!! |
Hello @IIEleven11 I'm moving my question from #362 to here. Before proceeding with the question, I have read this thread and saw that you put some instructions at the beginning, and I don't know if they still apply. My question is not about how it works per se, but to know if indeed all talk uses the BPE tokenizer that has been trained in the inference, or embeds it somehow in the vocab.json or in the weights? Since from what I was seeing, at the time of fine-tuning, all talk always uses the vocab.json of the base model (original or custom), and if then in the inference I manually point to the path of the vocab bpe custom, it gives me a missmatch error. Thank you very much in advance.
The trained tokenizer: The used tokenizer: P.S: |
If you clone the branch I used to send the PR then those scripts should work for you. As for your error, if you attempted to use the default process for training the new tokenizer then the error you got is consistent with what this PR is attempting to fix. This happens because the base model was not being expanded according to the new vocabulary. This results in the size mismatch you got. The process is merge it with the base model vocab.json then freeze the base model except the embeddings layers then expand the models embeddings layers using the vocab you merged. Then you can begin fine tuning with your newly expanded model and its vocab. |
UPDATE: bug.all.talk-1.mp4The only thing I can mention is that I modified the scripts so that they can handle utf-8 files (since the language is Spanish and has accents). https://gist.github.com/Mixomo/e6a82c6a373ed8a8925cc5eb12176d79 The base model was a custom one dedicated to Spanish, and while I'm not sure what exact version of XTTS V2 it was, I think it's the same, otherwise it wouldn't have let me train, right? The version of Coqui AI TTS that I have is the new one that came out a few days ago, maybe that is the reason? Should I go back to the previous version? What I will do now is to train with the original XTTS base model, to see if I get different results. Thanks |
UPDATE # 2: Training from the original base model worked, however I notice that speech does not have the same flexibility as training it with the original tokenizer and the Spanish base model, as it skips words and/or syllables. |
Yeah I wouldn't attempt to train a new tokenizer from a model that isn't the base model. Its not impossible, but there would be other nuances you would need to address. As for the new model you made, I am glad it worked for you, although there were some errors. What I guess is happening is that the model is getting new vocabulary but not enough data to train/learn on that vocabulary which results in what you're hearing. The answer to this problem is just to provide it with a significant amount of training data. For reference I trained a model with a special token [whisper]. Where I gave it 40 hours of pure whispering. I had attempted it a few times prior with less data and got sub par results. It either had no idea what that token meant or would only work sometimes. So my theory is that you should be giving it somewhere between 30-40 hours or more to train on. I understand this is not a small number for the average person but when we consider in relation to the amount of training data the base model had and all other models in general, it actually is a very small number. |
@IIEleven11 I don't think I'm going to get a chance to test this for a while, so I'm happy just to pull it in. Obviously @Mixomo has tested it now and it clearly worked through, so Im sure it will be fine for most use cases. I had to put this statement up about my current situation and Ive been fire fighting to try deal with support issues on Github when I can. I want to try write some Finetuning Wiki stuff for people, probably a mix of the existing instructions, the video you linked and I guess I should add any other detail. I can and have been writing the Wiki https://github.com/erew123/alltalk_tts/wiki as I can do that with just a laptop. @IIEleven11 if you have any thoughts for anything to add, let me know, but Im going to give things 48 hours for to calm down here on Github, then Im going to merge this in, assuming all is well and quiet again! Thanks so much again!! :) |
@IIEleven11 Oh, not sure if this makes any sense to you or what you think about it #368 Ive not been able to look at this at all. Im not suggesting you do anything, but if you have any thoughts on it, Id be happy to hear them. Thanks |
@IIEleven11 oh and maybe this is something that idiap who manages the Coqui scripts and base coqui scripts needs to look at, rather than anything in the Finetuning here...... |
Hope all is well man, no rush, life is life. As far as teaching people how to train models. It's always more complex then it appears. The tokenizer video is great. I have another one on overfitting https://www.youtube.com/watch?v=Gf5DO6br0ts. Ive been training/finetuning models for awhile now and if I had to pick what single biggest factor of a quality model is, it would be the dataset, by an extremely large amount. All of their time should be spent making sure it's pristine. As in, its segmented well, clear/noiseless audio, includes audio that spans the entire phonemic spectrum, has a gaussian distribution of audio length/text, proper sample rate, etc. I actually have a repo where I make attempts to automate the dataset curation process. At its core its a bit complex but the theory is abstracting all of that away. https://github.com/IIEleven11/Automatic-Audio-Dataset-Maker.git. It does by default, spit out a xttsv2 dataset format as well as a huggingface hub dataset. So it should work for users right now out of the box. As for the prodigy optimizer. I briefly looked it over and while it appears to be a slot in LR option and it working with pytorch. I highly doubt actually implementing it with all of the alltalk models will be a simple task. It is just a way to automatically adjust learning rate. There's no guarantee it will be more ideal than manually adjusting learning rate/using a scheduler. But, if it is a slot in simple addition/improvement, then sure why not? |
Hi @IIEleven11 I've fiiiiiiiiinally pulled in the PR :) I had a few busy days and a suspicion I may have updated something between your PR and the code base at some point, so just wanted to check that before pulling it in (I hadn't made any changes it appears). I had to make 2x small changes and also added a line to make Gradio quiet about the fact there is a new Gradio version and to update etc. bb314fa Over the next few days, Im hoping to get time to digest your suggestions for documentation and hopefully get something written, though Im going to be doing a bit of catch up first with other support requests, package version changes etc. (probably going to test out Pytorch 2.4 and a couple of other things) and then get the documentation written. Obviously merging in the PR closes the PR, but ill catch you back here or feel free to catch me back here). I just want to say thanks again for working on this! Thanks for being patient with me taking my time to merge the code in etc! |
Nice!
Totally my oversight though, i meant to remove those. My bad. Glad youre back though, let me know if you have any questions |
I've made a couple of updates to finetuning. Nothing that over-rides anything you have done. I still need to do a bit of work to repair terminal console output (that I damaged). Data validation section is massively improved:You can edit/manage the metadata files all in the one page/interface now. Wav filesAll wav files now get dumped out and torchaudio actually looks at the wav file length to do that (its really quick): along with an audio report on the wav files:
WIKI PagesI've taken a first round shot at doing a simple guide and a very detailed guide, based off some of the things you pointed me towards in the past. https://github.com/erew123/alltalk_tts/wiki/XTTS-Model-Finetuning-Guide-(Simple-Version) I've used a mix of myself and AI to write it. I have given it a few reads, but there is a lot to get through, some of it is heavy going and above my pay grade. You're welcome to give it a glance (if you want) and tell me anything to add/change/remove (not sure if someone else can edit the wiki or not). You are also welcome to completely ignore it :) |
@IIEleven11 For what its worth, I have done a huge re-work of training. Ill be uploading it soon, but I have quite a bit of work to do to post it up, mostly documentation updates. However, first off, all your bits remain the same. Most of the rework is for dataset generation, documentation, visuals and layout. Help sections are detailed and easy to pick out: as for the training guide, well, theres now a lot of it up there!! and its nice and detailed: I've gone over everything in the documentation and interface..... It should be pretty decent! Will get it posted in the next 24-48 hours. |
I have to commend you, you are extremely organized and thorough. It is an admirable quality. It looks really good man. If i may some give some constructive points of criticism or concepts to maybe look into...
Here is my specific solution for gaussian distribution of audio length segments. It uses .srt transcription as part of the segmentation process then goes into forcing the distribution. If you decide to dive deeper and give the end user the ability to do more advanced level curation, I find it very effective. A final point of criticism... The project is getting larger and more complex, you're adding layers of abstraction that make it more difficult for advancers users to work with. For example, I wanted to train on the xttsv2 model recently. It was working locally but not in a cloud instance. The traceback/error output seemed possibly be truncated or maybe output somewhere else. After some print debugging I had to just move to something else. My theory is that this is a linux specific issue. It was having trouble reading the model for some reason. Adding print debugging on larger and larger codebases that are not your own can sometimes not be so obvious. Again though, well done. It looks great and is coming together nicely.
|
@IIEleven11 Hope you are well and thanks for the feedback :) This was a lot more than a visual update with added documentation, it was actually a 40+ hour re-write code alone, Still have the WIKI documentation to tidy up :/. The update may well already cover off things you mention above (examples below). Also the code inside is a hell of a lot more documented now. If you think of anything else that's missing, let me know. And of course, thanks for your help in the past (be it you use it or not). All the best LinuxIt should all be fine on linux, will have to double check it, but I cant think of an issue. There are quite a lot of checks and balances in place now though. I cant say nothing could go wrong, but it should be pretty solid. There are certainly extra checks in place for finding models and checking all the files are there/telling you if not. PFCI did update the PFC to give clear "this is whats wrong" messages and also below it are quick help sections too. Transcription UpdateTo be honest the whole dataset transcription was re-written, pretty much from the ground up. I've changed over from Faster-Whisper, which dropped 1.5GB off the installation requirements. Re min/man audio, min was there in the back end code, but I have added it in as a user selectable and researched/set the audio length as defaults for dataset/audio creation (thanks for the pointer). This is how transcription/dataset creation works now though:
Added Min/Max audio as user configurable (as you suggested). Dataset generation debuggingI did quite a bit of work on debugging and logic. Info, Warnings & Errors are always logged out, the rest is selectable as you wish.
Training debuggingAlthough I had put some extra logic and debugging in, I expanded it a little tonight and put some dropdowns there so you could pick/choose what you want.
|
Hello, with the current version of all talk, is it still necessary to use the conversion and merge scripts to train a custom BPE tokenizer? It would also be nice to be able to train the BPE without using the transcription stage in cases where the user already has the dataset assembled in the path that all talk expects. Thanks! |
You'll see two scripts. compare_and_merge.py and expand_xtts.py.
I didn't do any integration with alltalk so these scripts are capable of running as is, standalone.
steps to use
You now have an expanded base xttsv2 "expanded_model.pth" and its pair "expanded_vocab.json"
The base xttsv2 model needs to be removed from the file path "/alltalk_tts/models/xtts/xttsv2_2.0.3/model.pth"
The base "vocab.json" needs to be removed from the file path "/alltalk_tts/models/xtts/xttsv2_2.0.3/vocab.json"
Place "expanded_model.pth" and "expanded_vocab.json" in the place of the removed base model/vocab at path "/alltalk_tts/models/xtts/xttsv2_2.0.3/". Rename them to "model.pth" and "vocab.json".
That's it you can now begin fine tuning.
You'll find each file commented with more detail about what's going on. Finetune.py had an edit i was using to rotate the port because when using an online instance, when I have to end the script the port can linger blocked. Which causes the script to fail and I have to go in and change the port. So just setting a range from port # - port # fixes that issue. But I removed it as it's beyond the scope of this specific PR. I can send it in another if that's something you want to implement.