Trials of ideas. #38
Replies: 7 comments 9 replies
-
@rbruels Come to this Dev chat here! (sorry for all the confusing messages!) |
Beta Was this translation helpful? Give feedback.
-
My thoughts are:
|
Beta Was this translation helpful? Give feedback.
-
@rbruels Sorry for the spamming. I didn't think I was going to get as far as I managed to get today! I think I said Happy New Year somewhere, but if I didn't, Happy New Year! All your changes, along with my own have been merged together and are now in the live "main". I had to make 2x small amendments to your code, and I added some filtering to the "demo" page, as well as made the box a bit friendlier (fixed size, lists all your voices etc). The streaming is great! Awesome Job and a big thanks from me! Id love to get a whole page merged into one as I mentioned, with your streaming and the bits I've been doing. (I guess Ill be back to figuring out why SillyTavern wont integrate its JS code in a bit though, and I really don't have a clue with that) Also, thanks for what you did with the Admin page, that has made it much easier! It looks a bit cleaner and I managed to go through and clear up loads of spelling mistakes now it was easier to get into! That was high on my to do list, but not in the urgent pile! So if you want to work on any more or have a go at any more bits, you can work off main now! :) I really appreciate the input/help! I wanted to ask if you want your name in the "thanks" etc area down the bottom of the built in documentation? I didn't just want to assume. If you have any thoughts on my rambling stuff above, great! My demo page thing is in the templates folder. Either way, thanks so much! |
Beta Was this translation helpful? Give feedback.
-
Hey @erew123, happy new year! No worries, I appreciate the stream of consciousness. 😆 I think the chunking implementation is a cool add! It's perfect if you're not looking to playback in realtime, or if you want to preserve long passages. I'll check out all your changes, sync to the More general topic: I have no idea what your bigger intentions are for this project. I stumbled across your Reddit post announcing the plugin because I was hunting down some info on fine-tuning XTTS, and I found gold! From where I'm sitting, what you've created is the most accessible, understandable way I've seen for noobs to learn and actually experiment/build with all the processes, tools, and weird little quirks of deep-learning-style TTS, plus voice cloning and even fine-tuning as a bonus. There are some projects out there that are improving specific components of TTS pipelines, and some projects that are demonstrating how to connect certain parts of those pipelines, but so far I've seen no project that has the right combination of:
that you need to really build cool products on top. The projects and information out there are pretty scattered and technically dense -- you kind of have to be a deep learning expert already to make meaningful progress. As a result, I think developers (or even just normal tech-savvy humans) get discouraged from experimenting with all this cool technology. But with I think you have the potential to make this project the one-stop shop for doing modern TTS. It can do insanely cool stuff out of the box, it can act as your API for TTS-enabled products, and also lets you learn and experiment with the deeper technical capabilities at a reasonable pace, with good documentation and examples. tbh this guy really nailed it: https://reddit.com/r/Oobabooga/comments/18tzwt4/can_you_get_coqui_tts_to_just_read_text_you_give/ People see the value of this tech, but it's hard to grok if you're not a total nerd in this space already. It'd be awesome to have a project that "just works", but then lets people customize and build to their liking. This might not be your dream for this at all, lol -- but it's what I see (and I think why the community seems hyped about the project). If that's how you see this, some of the work we could take on (this is totally just top of my mind):
tl;dr You've built something pretty magic for the community. If it's your goal to make this a really accessible way to learn/build/integrate TTS, then let's make it even better! And if not... let me know what your goals are so I can adapt! haha. I just want to support the project. btw, my skills are primarily in frontend, API backends, and infra, so I'm happy to take on more of that web/API side of things, sounds like that'll be a good balance. I am one of the aforementioned noobs to this hot new TTS/cloning space and you've already helped me learn a ton. My goal originally (and I'll still build this, to stress-test
|
Beta Was this translation helpful? Give feedback.
-
@rbruels I'm blown away with your reply! Thanks! And it sounds like I've made some moves in the right direction at least.
Yes that's pretty much my goal. Well, it kind of started as a "I think I could make this run a bit better on my system" and, well, it just spiralled out of control haha! To be honest, the response and uptake from people has been pretty amazing and whilst I was just building a better TTS for Text-gen initially, along the way I figured it could be a lot more. Let me tell you what I am and what I'm not. What I'm not, is a coder, in fact, until about 30 days ago, I had never coded a single line of python in my life. Ok, maybe I am a coder now, I guess! I've had a very very long career in IT though, a good few qualifications under my belt, run a lot of small and global projects, worked with a lot of well known names, written a few industry qualifications, helped companies design their applications/flow/infrastructures, spent a lot of time looking at other peoples code etc. I'm a bit hard to nail down, so lets just say I'm an IT generalist, that's probably easier. I also understand that bridge you are talking about:
Too many things are overly complicated for your average person and Let me tell you my next few item hit list:
I've nailed the narrator function to the wall today, so I'm 99% sure that one is done and dusted now (thank god ....and crap, Ive just tempted fate with that!). Let me go down your list of bits:
Just on that subject, one thing I have aimed for is to ensure that I keep the requirements in line with text-gen-webui's. Partly for simplicity as most people wont have to worry about building more Python environments etc, but also because I don't want to be affecting other things running in text-gen-webui by over-writing their requirements. Likewise I guess, I don't want to make it an unwieldly beast that chews all your system memory up. So those have been partly guiding principles for me.... thus far.
Apologies I'm rambling here and its late. I guess the summary is, I'm happy to have your involvement, ideas, criticisms. I think its important to keep the base functionality there. I don't want to turn it into a unwieldly behemoth that people struggle to run on their systems, so ultimately I've thought about making it more like a plug-in option thing e.g. you want a different TTS engine, there's a script/plugin for that (or whatever). Ive not quite got my head around it all yet as Its been busy for a month and my main goal was to make a solid base. Not sure how you want to co-ordinate on anything. I figured I would throw this Dev area here, at least it makes it kind of easy. Open to other ideas, but if you want to stick on here, happy if you want to open individual topics rather than have one long chain. Ill leave it there, as I'm writing chapter and verse here.! |
Beta Was this translation helpful? Give feedback.
-
OK... maybe I side tracked a little this morning..... |
Beta Was this translation helpful? Give feedback.
-
@rbruels Hope you're keeping well. Just wanted to say, I think I've about got this nailed..... so you may not want to spend any time on it. I've got a few bits to finish yet, but its almost there. It plays back very smoothly.. you would think you were streaming. Generates a few, caches them, then starts playing while the others are generated in the background. Queues them up, no gaps. You will be able to export the whole lot out, or if you want to re-gen any sections, you can, then export after. |
Beta Was this translation helpful? Give feedback.
-
@rbruels EDIT - You may need to read all the below, but the long and short is all your changes are incorporated into the live version! :) Thanks!
Loved your PR re streaming! Really awesome... I changed a little code, though I have not merged it yet, I wanted you to feel happy about this. I saw this at the command line....
C:\AI\text-generation-webui\installer_files\env\Lib\site-packages\transformers\generation\utils.py:1518: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration ) warnings.warn(
My best guess is its this
stream_chunk_size=20 if streaming else None,
in thetty_server.py
line (probably)So being worried that one day XTTS generation will conk out for everyone, no matter how they are generating. I've changed that bit of code as follows, to give both options on the "if streaming".
I have merged all your code, with the above AND my changes https://github.com/erew123/alltalk_tts/tree/tests
My plan is if you think its all good, I will PR you code over to Dev and then merge mine in etc...
Here is something I was working on, but I've not got fully working yet.
Its not streaming audio, but what it does do, is allow you to dump in a huge chunk of text, then you can choose how bigger chunk to send to the TTS engine via the API.
You can choose if it plays back the audio on the server side (which has a pacing issue) OR you can play the audio back in your browser, and because it queues them up in the browser, it doesn't have any gaps, so it sounds like you are listening to a complete flowing text. In theory, you could copy an entire book in there and have it play back (dont try that, but you could try a good page or two).
It mostly works, however on playback in the browser, there is a stutter issue on the first or some message playbacks as it sends off further requests to generate TTS ahead of time.
I had debated creating logic that would generate 3x messages ahead of the audio we are currently listening to, download those and store them temporarily for playback. When the queue drops to 2, send off another generation request so we are 3x ahead again, and separate the playback thread and the generate thread in the browser, so that you don't have the pause/stutter.
Specifically for this, in
tts_server.py
I created an api endpoint that will just forcefully dump the generated file at you/api/audiocache
(though the code I am sending isnt using that currently. The idea was to use this caching for the 3x message ahead method, if I could get it working). I also had to modify the corrs settings to make this work, so you would need this copy oftts_server.py
I got so far into writing testing that other code... and I'm giving that a break for now.
If you're interested in taking a look, pull everything from here https://github.com/erew123/alltalk_tts/tree/tests (quite a few files are updated)
Id love to kind of merge together the thing I was doing and this. Let me explain my thinking with what I sent you above, maybe you have some thoughts on this (Id welcome any insight/ideas, if you have time and criticism if needed ha) BTW as you have probably guessed, I'm not a web coder!!.
So the API will now play sound wherever the script is running, so I thought it was a good idea to have that option through a web page hence my page I included.
Some people have been asking to be able to play large chunks of audio and potentially generate all the wav files so they can later compile them into one big wav file, think audio books or maybe uni lecture notes, maybe even reading back what they wrote for proof reading purposes. So this is why I was working on a non-streaming version that could play back to the browser, but still generate all the wav files, for them to do with as they please later on!
Having a streaming audio option as well would be amazing! I wonder how much text you can pump into it at once?
I don't know how far you are willing to go with this.... What you've done is already fantastic! If you're willing and able to do a bit more... Great! If not, that's cool too! But if you do have time and have thoughts about smashing the demo page and the page I sent you together. Its up to you! :)
Sorry if this is a rambling message, hopefully some of it makes sense!
Beta Was this translation helpful? Give feedback.
All reactions