Trials of ideas. #38

erew123 · 2024-01-01T15:08:32Z

erew123
Jan 1, 2024
Maintainer

@rbruels EDIT - You may need to read all the below, but the long and short is all your changes are incorporated into the live version! :) Thanks!

Loved your PR re streaming! Really awesome... I changed a little code, though I have not merged it yet, I wanted you to feel happy about this. I saw this at the command line....

C:\AI\text-generation-webui\installer_files\env\Lib\site-packages\transformers\generation\utils.py:1518: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration ) warnings.warn(

My best guess is its this stream_chunk_size=20 if streaming else None, in the tty_server.py line (probably)

So being worried that one day XTTS generation will conk out for everyone, no matter how they are generating. I've changed that bit of code as follows, to give both options on the "if streaming".

# Common arguments for both functions
common_args = {
    "text": text,
    "language": language,
    "gpt_cond_latent": gpt_cond_latent,
    "speaker_embedding": speaker_embedding,
    "temperature": float(params["local_temperature"]),
    "length_penalty": float(model.config.length_penalty),
    "repetition_penalty": float(params["local_repetition_penalty"]),
    "top_k": int(model.config.top_k),
    "top_p": float(model.config.top_p),
    "enable_text_splitting": True
}

# Determine the correct inference function and add streaming specific argument if needed
inference_func = model.inference_stream if streaming else model.inference
if streaming:
    common_args["stream_chunk_size"] = 20

# Call the appropriate function
output = inference_func(**common_args)

# Process the output based on streaming or non-streaming
if streaming:
    # Streaming-specific operations
    file_chunks = []
    wav_buf = io.BytesIO()
    with wave.open(wav_buf, "wb") as vfout:
        vfout.setnchannels(1)
        vfout.setsampwidth(2)
        vfout.setframerate(24000)
        vfout.writeframes(b"")
    wav_buf.seek(0)
    yield wav_buf.read()

    for i, chunk in enumerate(output):
        file_chunks.append(chunk)
        if isinstance(chunk, list):
            chunk = torch.cat(chunk, dim=0)
        chunk = chunk.clone().detach().cpu().numpy()
        chunk = chunk[None, : int(chunk.shape[0])]
        chunk = np.clip(chunk, -1, 1)
        chunk = (chunk * 32767).astype(np.int16)
        yield chunk.tobytes()
else:
    # Non-streaming-specific operation
    torchaudio.save(output_file, torch.tensor(output["wav"]).unsqueeze(0), 24000)

I have merged all your code, with the above AND my changes https://github.com/erew123/alltalk_tts/tree/tests

My plan is if you think its all good, I will PR you code over to Dev and then merge mine in etc...

Here is something I was working on, but I've not got fully working yet.

Its not streaming audio, but what it does do, is allow you to dump in a huge chunk of text, then you can choose how bigger chunk to send to the TTS engine via the API.

You can choose if it plays back the audio on the server side (which has a pacing issue) OR you can play the audio back in your browser, and because it queues them up in the browser, it doesn't have any gaps, so it sounds like you are listening to a complete flowing text. In theory, you could copy an entire book in there and have it play back (dont try that, but you could try a good page or two).

It mostly works, however on playback in the browser, there is a stutter issue on the first or some message playbacks as it sends off further requests to generate TTS ahead of time.

I had debated creating logic that would generate 3x messages ahead of the audio we are currently listening to, download those and store them temporarily for playback. When the queue drops to 2, send off another generation request so we are 3x ahead again, and separate the playback thread and the generate thread in the browser, so that you don't have the pause/stutter.

Specifically for this, in tts_server.py I created an api endpoint that will just forcefully dump the generated file at you /api/audiocache (though the code I am sending isnt using that currently. The idea was to use this caching for the 3x message ahead method, if I could get it working). I also had to modify the corrs settings to make this work, so you would need this copy of tts_server.py

I got so far into writing testing that other code... and I'm giving that a break for now.

If you're interested in taking a look, pull everything from here https://github.com/erew123/alltalk_tts/tree/tests (quite a few files are updated)

Id love to kind of merge together the thing I was doing and this. Let me explain my thinking with what I sent you above, maybe you have some thoughts on this (Id welcome any insight/ideas, if you have time and criticism if needed ha) BTW as you have probably guessed, I'm not a web coder!!.

So the API will now play sound wherever the script is running, so I thought it was a good idea to have that option through a web page hence my page I included.

Some people have been asking to be able to play large chunks of audio and potentially generate all the wav files so they can later compile them into one big wav file, think audio books or maybe uni lecture notes, maybe even reading back what they wrote for proof reading purposes. So this is why I was working on a non-streaming version that could play back to the browser, but still generate all the wav files, for them to do with as they please later on!

Having a streaming audio option as well would be amazing! I wonder how much text you can pump into it at once?

I don't know how far you are willing to go with this.... What you've done is already fantastic! If you're willing and able to do a bit more... Great! If not, that's cool too! But if you do have time and have thoughts about smashing the demo page and the page I sent you together. Its up to you! :)

Sorry if this is a rambling message, hopefully some of it makes sense!

That way if they do change something one day, only the streaming portion would fail, but standard wav file gen would keep going. At least it would only affect a % of people in that instance. What do you think?

erew123 · 2024-01-01T15:11:13Z

erew123
Jan 1, 2024
Maintainer Author

@rbruels Come to this Dev chat here! (sorry for all the confusing messages!)

0 replies

erew123 · 2024-01-01T16:02:34Z

erew123
Jan 1, 2024
Maintainer Author

My thoughts are:

One page where you can dump a load of text in.
Two choices, you can stream generate audio (which is only local and happens in the browser) OR you can chunk away at audio (like my code) which will generate all the wav files and people can combine those those later on if they like.
If chunking you can play remotely (where the script is running) or in the local browser.
It will need a pause option and a stop generating option (chunking already does this, not sure how that works with streaming).
The code to do the stripping/clean-up of text from my "page" needs to be added, because this cleans up lots of strange noise production. This will get most of the way to a clean TTS Have added this to the admin.html

0 replies

erew123 · 2024-01-01T20:43:41Z

erew123
Jan 1, 2024
Maintainer Author

@rbruels Sorry for the spamming. I didn't think I was going to get as far as I managed to get today! I think I said Happy New Year somewhere, but if I didn't, Happy New Year!

All your changes, along with my own have been merged together and are now in the live "main". I had to make 2x small amendments to your code, and I added some filtering to the "demo" page, as well as made the box a bit friendlier (fixed size, lists all your voices etc). The streaming is great! Awesome Job and a big thanks from me! Id love to get a whole page merged into one as I mentioned, with your streaming and the bits I've been doing. (I guess Ill be back to figuring out why SillyTavern wont integrate its JS code in a bit though, and I really don't have a clue with that)

Also, thanks for what you did with the Admin page, that has made it much easier! It looks a bit cleaner and I managed to go through and clear up loads of spelling mistakes now it was easier to get into! That was high on my to do list, but not in the urgent pile!

So if you want to work on any more or have a go at any more bits, you can work off main now! :) I really appreciate the input/help!

I wanted to ask if you want your name in the "thanks" etc area down the bottom of the built in documentation? I didn't just want to assume.

If you have any thoughts on my rambling stuff above, great! My demo page thing is in the templates folder.

Either way, thanks so much!

0 replies

rbruels · 2024-01-01T22:26:05Z

rbruels
Jan 1, 2024

Hey @erew123, happy new year! No worries, I appreciate the stream of consciousness. 😆 I think the chunking implementation is a cool add! It's perfect if you're not looking to playback in realtime, or if you want to preserve long passages. I'll check out all your changes, sync to the dev branch, and provide any feedback later tonight for sure. Your ideas for the UX seem solid from an initial read, but I'll think on it once I play with the new functionality a bit.

More general topic: I have no idea what your bigger intentions are for this project. I stumbled across your Reddit post announcing the plugin because I was hunting down some info on fine-tuning XTTS, and I found gold! From where I'm sitting, what you've created is the most accessible, understandable way I've seen for noobs to learn and actually experiment/build with all the processes, tools, and weird little quirks of deep-learning-style TTS, plus voice cloning and even fine-tuning as a bonus.

There are some projects out there that are improving specific components of TTS pipelines, and some projects that are demonstrating how to connect certain parts of those pipelines, but so far I've seen no project that has the right combination of:

an out of the box API
understandable demo apps
presets/defaults for some of the complex stuff you don't need to understand on day 1
good sample voices and example code
and human-readable documentation

that you need to really build cool products on top.

The projects and information out there are pretty scattered and technically dense -- you kind of have to be a deep learning expert already to make meaningful progress. As a result, I think developers (or even just normal tech-savvy humans) get discouraged from experimenting with all this cool technology. But with alltalk, you can just... jump right in!

I think you have the potential to make this project the one-stop shop for doing modern TTS. It can do insanely cool stuff out of the box, it can act as your API for TTS-enabled products, and also lets you learn and experiment with the deeper technical capabilities at a reasonable pace, with good documentation and examples.

tbh this guy really nailed it: https://reddit.com/r/Oobabooga/comments/18tzwt4/can_you_get_coqui_tts_to_just_read_text_you_give/

People see the value of this tech, but it's hard to grok if you're not a total nerd in this space already. It'd be awesome to have a project that "just works", but then lets people customize and build to their liking.

This might not be your dream for this at all, lol -- but it's what I see (and I think why the community seems hyped about the project). If that's how you see this, some of the work we could take on (this is totally just top of my mind):

Generalize the API to the fundamentals, and move some of the "feature" capabilities into a higher layer. Right now tts_server is a mix of API-type functionality and actual product-ish logic, and the API has different functionality based off where/how you run it. A few dumb examples, some of which I'm sure you're already planning:
- The API has a param autoplay to set whether or not it should play audio on the local device. If I deployed the API to an AWS container, this would have totally undefined behavior. Instead, I'd propose removing that from the API, but creating a separate script/app/local server that calls the API and plays the result locally.
- The wav outputs are stored on disk, and the API expects arguments like a file name. The API should completely abstract and manage its own storage (could be on disk, in memory, etc), but a higher-level script/app/local server could use the API outputs to save to disk.
Improve the load/reload and configuration logic so I can change and test more parameters without restarting the script
Remove as much hardcoded stuff as possible (for example, making models/names chooseable -- when I fine-tuned my model, I had to rename xttsv2_2.0.2 to xttsv2_2.0.2_old and rename my model to xttsv2_2.0.2).
Continue to expand the API and improve the UI on top of some of the current scripts. For example, I think you would blow minds if finetune.py had a lightweight API and demo UI to it. I don't know if you realize how much simpler you've already made fine-tuning for noobs (especially abstracting the ASR/sentence-cutting/dataset creation stuff)! People have been looking for "ElevenLabs-level quality" and you basically delivered it in one script! :)
Keep building out really cool demos/features on top of the API like the chunking playback and the narrator capability etc etc... stuff people want, that works out of the box, but also helps people learn how to build cool stuff.
Decouple everything completely from text-generation-webui. Maybe this is actually done already? I know I've been running standalone for the last week. Integrating with SillyTavern will probably be a good test of this.
Build a Microsoft Word plugin for that person above (kidding. mostly. 😆)

tl;dr You've built something pretty magic for the community. If it's your goal to make this a really accessible way to learn/build/integrate TTS, then let's make it even better! And if not... let me know what your goals are so I can adapt! haha. I just want to support the project.

btw, my skills are primarily in frontend, API backends, and infra, so I'm happy to take on more of that web/API side of things, sounds like that'll be a good balance. I am one of the aforementioned noobs to this hot new TTS/cloning space and you've already helped me learn a ton. My goal originally (and I'll still build this, to stress-test alltalk) was to fine-tune a specific voice model (done ✔️) and expose a streaming API for it so I could use it in a real-time conversational app. I can keep using that use case to start stress-testing and improving the APIs, and we can take more ideas/features from the community too.

</walloftext>

0 replies

erew123 · 2024-01-02T00:57:44Z

erew123
Jan 2, 2024
Maintainer Author

@rbruels I'm blown away with your reply! Thanks! And it sounds like I've made some moves in the right direction at least.

People see the value of this tech, but it's hard to grok if you're not a total nerd in this space already. It'd be awesome to have a project that "just works", but then lets people customize and build to their liking.

Yes that's pretty much my goal. Well, it kind of started as a "I think I could make this run a bit better on my system" and, well, it just spiralled out of control haha! To be honest, the response and uptake from people has been pretty amazing and whilst I was just building a better TTS for Text-gen initially, along the way I figured it could be a lot more.

Let me tell you what I am and what I'm not. What I'm not, is a coder, in fact, until about 30 days ago, I had never coded a single line of python in my life. Ok, maybe I am a coder now, I guess! I've had a very very long career in IT though, a good few qualifications under my belt, run a lot of small and global projects, worked with a lot of well known names, written a few industry qualifications, helped companies design their applications/flow/infrastructures, spent a lot of time looking at other peoples code etc. I'm a bit hard to nail down, so lets just say I'm an IT generalist, that's probably easier.

I also understand that bridge you are talking about:

The projects and information out there are pretty scattered and technically dense -- you kind of have to be a deep learning expert already to make meaningful progress. As a result, I think developers (or even just normal tech-savvy humans) get discouraged from experimenting with all this cool technology. But with alltalk, you can just... jump right in!

Too many things are overly complicated for your average person and just stick the widget in the dongle makes no sense to them (heck, sometimes to me too). So yes, my goal has partly been to make it simple, accessible, but also flexible. Plus, that also keeps your support issues down if its well documented and easy hah!

Let me tell you my next few item hit list:

I wanted to get SillyTavern working as I've had so many requests for it. I've got the scripts written and they half work. ST pulls down the voices etc, but I cannot for the life of me figure out why ST isn't calling my function to send out the request to AT. Everything looks right, but ATM, I cant figure what gives there. I thought I might contact the ST devs in a couple of days (post NY) see if they could give me a friendly hint! (you're welcome to have a look too if you want).
Id like to figure something about what we mentioned re bulk TTS generation (like my recent addition or your streaming addition). Lots of people really seem to want something like this. I got a little further with this, Im just trying to work on a threading issue with playback when it caches the wavs.... but if you have any ideas if you take a look at what I sent, great!
I was going to give Mac Metal GPU hardware acceleration a shot at some point. It should be pretty simple to implement . I just need to find someone with a Mac and Ill have to code around a couple of things.

I've nailed the narrator function to the wall today, so I'm 99% sure that one is done and dusted now (thank god ....and crap, Ive just tempted fate with that!).

Let me go down your list of bits:

The API has a param autoplay to set whether or not it should play audio on the local device. If I deployed the API to an AWS container, this would have totally undefined behaviour. Instead, I'd propose removing that from the API, but creating a separate script/app/local server that calls the API and plays the result locally.
Debated this, I'm neither for nor against. If Autoplay is false, then the script will do nothing with it and nothing will happen. It was just a concept idea that some people are firing things at the API and wanting their "TTS server" to be the thing making the noise. So I introduced this as the option and bunged a local play from the script in there. Im also happy to make a simpler endpoint if needed that needs less inputs. That may be another route. To be honest, there is some overlap going on with endpoints that I really should tidy up.

The wav outputs are stored on disk, and the API expects arguments like a file name. The API should completely abstract and manage its own storage (could be on disk, in memory, etc), but a higher-level script/app/local server could use the API outputs to save to disk. Improve the load/reload and configuration logic so I can change and test more parameters without restarting the script
Id have to think on this... when I'm more awake. I think I know what youre hitting at.

Remove as much hardcoded stuff as possible (for example, making models/names chooseable -- when I fine-tuned my model, I had to rename xttsv2_2.0.2 to xttsv2_2.0.2_old and rename my model to xttsv2_2.0.2).
Yes I agree and I did make some changes to the finetuning, there's no more manual copying things (if you look at the version from about 2 days ago), though I still think it could be improved further. I would have preferred to make the interface a web one, served from AT perhaps (would have to unload the model so that the training can load it in, but that's easy to do), Gradio was just easier as there were snippets of code around.

Continue to expand the API and improve the UI on top of some of the current scripts. For example, I think you would blow minds if finetune.py had a lightweight API and demo UI to it. I don't know if you realize how much simpler you've already made fine-tuning for noobs (especially abstracting the ASR/sentence-cutting/dataset creation stuff)! People have been looking for "ElevenLabs-level quality" and you basically delivered it in one script! :)
Thanks and yeah, I wanted something simple! I think we all need that as otherwise its like swimming through tar if you dont have a computer science degree. Im not fixed to using Gradio by the way, Im just not a web coder and gradio was simple. Though I did make sure it worked in Gradio 3.5.2 because thats what text-gen uses, and I didnt want to be forcing people to go off installing other requirements.

Just on that subject, one thing I have aimed for is to ensure that I keep the requirements in line with text-gen-webui's. Partly for simplicity as most people wont have to worry about building more Python environments etc, but also because I don't want to be affecting other things running in text-gen-webui by over-writing their requirements. Likewise I guess, I don't want to make it an unwieldly beast that chews all your system memory up. So those have been partly guiding principles for me.... thus far.

Keep building out really cool demos/features on top of the API like the chunking playback and the narrator capability etc etc... stuff people want, that works out of the box, but also helps people learn how to build cool stuff.
Yeah Im good with this!

Decouple everything completely from text-generation-webui. Maybe this is actually done already? I know I've been running standalone for the last week. Integrating with SillyTavern will probably be a good test of this.
It fully runs on its own. I've even written a guide on running it as a standalone. Ive tested the install a few times and I often just run it as a standalone when debugging/testing as its quicker to start up!

Build a Microsoft Word plugin for that person above (kidding. mostly. 😆)
Well, I dont think that would be too hard to be honest.... but its not high on the list right at the moment 😆

Apologies I'm rambling here and its late.

I guess the summary is, I'm happy to have your involvement, ideas, criticisms. I think its important to keep the base functionality there. I don't want to turn it into a unwieldly behemoth that people struggle to run on their systems, so ultimately I've thought about making it more like a plug-in option thing e.g. you want a different TTS engine, there's a script/plugin for that (or whatever). Ive not quite got my head around it all yet as Its been busy for a month and my main goal was to make a solid base.

Not sure how you want to co-ordinate on anything. I figured I would throw this Dev area here, at least it makes it kind of easy. Open to other ideas, but if you want to stick on here, happy if you want to open individual topics rather than have one long chain.

Ill leave it there, as I'm writing chapter and verse here.!

7 replies

agnosticlines Jan 10, 2024

Hey, been following this for a little bit, there's a commit/PR that may interest you here: pytorch/pytorch#116630 as currently the mps backend was unsuitable due to no FFT operations :)

erew123 Jan 11, 2024
Maintainer Author

@agnosticlines pffff... AMD ROCm support seems to be in the same area at the moment too. Thanks for the heads up on that one. Its one of those things I want to enable, but I want to spend hours trying to make something work if its very problematic.

agnosticlines Jan 11, 2024

@agnosticlines pffff... AMD ROCm support seems to be in the same area at the moment too. Thanks for the heads up on that one. Its one of those things I want to enable, but I want to spend hours trying to make something work if its very problematic.

No worries, I suspect once the FFT stuff is merged it will be a case of s/cpu/mps/g, i don't think there's any other unimplemented features needed, unless I'm mistaken. There's someone actively working on the PR it seems, so won't need any work from you directly. I dunno when it'll make it through the PR stage and merged as it's open source volunteer work but development seems to be active :)

erew123 Jan 12, 2024
Maintainer Author

@agnosticlines well, Im happy to give it a go when its supposedly working (and I have access to a mac)

agnosticlines Feb 22, 2024

Hey just wanted to let you know that FFT now works in pytorch nightly :)

erew123 · 2024-01-02T10:56:31Z

erew123
Jan 2, 2024
Maintainer Author

OK... maybe I side tracked a little this morning.....

2 replies

rbruels Jan 2, 2024

🤯

erew123 Jan 3, 2024
Maintainer Author

Its in the /templates/word_addin/ folder if you want to give it a look/spin. A little setup needed. Im thinking that on streaming audio etc, when its processing and until it gets a reply, it probably would be good to put a lock in the code on the generate buttons, as trying to send an additional request while its streaming makes things complain.... so Im just leaving a note/thought here.

erew123 · 2024-01-04T21:14:00Z

erew123
Jan 4, 2024
Maintainer Author

@rbruels Hope you're keeping well. Just wanted to say, I think I've about got this nailed..... so you may not want to spend any time on it. I've got a few bits to finish yet, but its almost there.

It plays back very smoothly.. you would think you were streaming. Generates a few, caches them, then starts playing while the others are generated in the background. Queues them up, no gaps. You will be able to export the whole lot out, or if you want to re-gen any sections, you can, then export after.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trials of ideas. #38

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 9 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Trials of ideas. #38

erew123 Jan 1, 2024 Maintainer

Replies: 7 comments · 9 replies

erew123 Jan 1, 2024 Maintainer Author

erew123 Jan 1, 2024 Maintainer Author

erew123 Jan 1, 2024 Maintainer Author

rbruels Jan 1, 2024

erew123 Jan 2, 2024 Maintainer Author

agnosticlines Jan 10, 2024

erew123 Jan 11, 2024 Maintainer Author

agnosticlines Jan 11, 2024

erew123 Jan 12, 2024 Maintainer Author

agnosticlines Feb 22, 2024

erew123 Jan 2, 2024 Maintainer Author

rbruels Jan 2, 2024

erew123 Jan 3, 2024 Maintainer Author

erew123 Jan 4, 2024 Maintainer Author

erew123
Jan 1, 2024
Maintainer

Replies: 7 comments 9 replies

erew123
Jan 1, 2024
Maintainer Author

erew123
Jan 1, 2024
Maintainer Author

erew123
Jan 1, 2024
Maintainer Author

rbruels
Jan 1, 2024

erew123
Jan 2, 2024
Maintainer Author

erew123 Jan 11, 2024
Maintainer Author

erew123 Jan 12, 2024
Maintainer Author

erew123
Jan 2, 2024
Maintainer Author

erew123 Jan 3, 2024
Maintainer Author

erew123
Jan 4, 2024
Maintainer Author