Enhancing Accessibility and Support for AllTalk Docker Integration #150

gboross · 2024-03-29T12:03:19Z

gboross
Mar 29, 2024

Hello,

Firstly, I apologize for any seemingly naive questions. Despite numerous efforts, the complexity of the AllTalk installation process has proven quite daunting. Is there a way to simplify and make this process more intuitive for users like us?

We're exploring the possibility of housing the entire setup within an updated Docker container, complemented by detailed, foolproof tutorials in both written and video formats, covering all aspects of installation and operation.

We currently utilize AllTalk with Docker on the cloud for STT real-time streaming. However, fine-tuning the model has become a primary goal, a task we've been unable to embark on due to not only unsuccessful standalone installations but also a lack of guidance on accessing certain functionalities showcased in the provided images.

Main issue with the model in Hungarian laguage:
For example, when using Coqui TTS in Hungarian, we encounter persistent issues such as the omission of the last letter in sentences, the erroneous reading of "dot" at sentence endings, excessive pauses, and peculiar noises. Although fine-tuning is frequently recommended as a solution, the root cause of these issues eludes us.

We recognize that both AllTalk and Coqui's original Google Colab-based Gradio UI version offer a largely automated fine-tuning procedure. Nonetheless, given the inaccuracies of tools like Whisper in recognizing Hungarian—where Whisper large_v3 shows improvement but remains flawed—we deem manual intervention necessary. This may involve reviewing and correcting transcriptions ourselves or generating and subsequently reviewing transcriptions elsewhere to ensure the accuracy and correctness of training materials.

Initially, we plan to compile our dataset with 10-15 hours of content to observe any potential improvements. If successful, we aim to create and train a more extensive dataset. Our methodology (same as on Coqui TTS dataset preparation readme) entails segmenting audio and preparing CSV and JSON files accordingly, with the goal of achieving at least 98% accuracy in our dataset.

Would it be feasible, after assembling a dataset with segmented audio files in a 'wavs' folder (e.g., wavs_00002000.wav, wavs_00002001.wav, wavs_00002002.wav), and accompanying CSV and JSON documents detailing segments such as:

wavs_00002000.wav|The cunning bait.
wavs_00002001.wav|Summer is great for its warmth, the cool and fish-filled river water.
wavs_00002002.wav|A fisherman requires nothing more for happiness.

And a JSON file listing entries like:

{"audio_file": "wavs_00002000.wav", "text": "The cunning bait."}
{"audio_file": "wavs_00002001.wav", "text": "Summer is great for its warmth, the cool and fish-filled river water."}
{"audio_file": "wavs_00002002.wav", "text": "A fisherman requires nothing more for happiness."}

...to be uploaded via a web UI interface launched from Docker? Following the upload, a security check would assess the audio quality, noise, etc., removing any unsuitable files from the dataset and updating the CSV and JSON accordingly before commencing training. This setup should smoothly handle large datasets (up to 10 hours), facilitating an easy integration (one-click solution) with AllTalk for streaming TTS and other applications, fully compatible with DeepSpeed.

Additionally, I'm curious if you engage in freelance projects or if there's a mechanism to financially support your team to expedite the development of this Docker setup within 1-2 weeks?

Ideally, starting Docker, accessing the localhost port, and managing everything through a web UI would be simple and intuitive, eliminating the need for manual execution of various Python scripts. A cloud-based streaming Colab notebook might also offer a viable alternative.

Crucially, we should be able to retrain a model multiple times, with the training phase yielding insights and analyses on the model's progress. Facilitating the selection of which base model to train, complete with automatic model download and management within the Docker container, is also a key requirement.

Are you open to this proposal? If so, please could you advise us on how and where we might contribute financially to make this vision a reality as soon as possible?

Thank you in advance for your assistance. I would greatly appreciate a personal contact for further discussions.

Thank you.

erew123 · 2024-03-29T14:56:30Z

erew123
Mar 29, 2024
Maintainer

Hi @gboross

That's quite a long mail you have sent, so I will try to break down my reply into sections.

I would like to say that within the last 24 hours, I have released an updated version of AllTalk, refreshed the documentation and worked on simplifying the installation requirements files (ill explain that later on).

Installation process (Standalone, not Docker or Google Colab)

If you are installing a standalone version on your own computer then Quick setup - Standalone Installation should be pretty simple and is also shown in this video

Installation Google Colab

Prior to today, google colab installations have been impossible to install due to requirements issues. However, just before your email arrived, I was co-incidentally looking at google colab to see if I could get it working. I had some successes and some issues. One of the successes looks like it might be Finetuning, though I have quite a bit to look through to understand any quirks, issues and build up a colab. If I get this working, I'm happy to provide you a link to the colab.

Installation Docker

The Docker build that exists, is based on an older version of AllTalk and was not built by myself. I personally am not an expert on Docker and have had my frustrations with it. Saying that, I did build an automated process to build/update docker images every time Github was updated, however, it started failing due to requirements installation issues, hence the build being out of date.

Though, with the release I sent out yesterday, my hope, is that now I have managed to clean up the installation requirements files, and with my work on Google colab, I will be able to get a working version of the installation routine for Docker.

I know there will be issues that exist because of things like tunnelling redirects to access AllTalk. e.g. internally, on the docker/colab, AllTalk thinks it runs on 127.0.0.1, but externally its running on x.x.x.x IP address, which isn't an issue as far as some portions of the interface go, but some bits of code don't like it. As I say, this is yet to be explored by myself.

Issues with the with the model in Hungarian language

I'm not an XTTS model expert and I also don't speak all the languages that the model is capable of. My anecdotal understanding and also experiences with the XTTS model show that extra punctuation can cause issues, outside of your standard comma, semi-comma, period etc. I detail this a little here TTS Generation Issues & Questions. I can say that the more I filtered out characters like asterisks *, double spaces , carriage returns etc, the better the generated TTS output was (less strange noises being produced). So with AllTalk (depending on how you are interacting with it) I filter out quite a few of these "other" items.

As for it pronouncing "dot", by which I assume you mean the period . at the end of sentences, that's a new one on me and perhaps its even specific to the text filtering the Coqui TTS scripts does or doesn't perform for the Hungarian language. I couldn't personally say, as I don't speak/read Hungarian or know the Coqui TTS scripts that intimately.

What I can say though is its possible that the XTTS v2.0.3 may be better with Hungarian (check here for details of using that) TTS Generation Issues & Questions and that broadly speaking, the better the audio quality samples, along with finetuning does seem to improve the quality of the reproduced voice along with less skips, repeats and mispronunciations. Though you may wish to hunt the Coqui discussions area https://github.com/coqui-ai/TTS/discussions

CSV & JSON files

Currently the finetuning (Step 1) generates a Eval file and a Training file, both in CSV format, which are stored in \alltalk_tts\finetune\tmp-trn.

There is also a lang.txt file, which is literally the 2x digit country code e.g.

The CSV files are broken into the format audio_file|text|speaker_name, very similar to how you describe them above.

Its fully possible to setup your own CSV file training set, copy them onto your disk somewhere and manually edit the path to these files in Step 2, and completely bypass Step 1 (whisper).

However, I assume your reference to these is that you are wanting:

To have a "upload" button for each of the CSV files, that is available in the web page, that will just put them on the local disk somewhere, then set them to be used for training?

And also

You mention this JSON file. Are you proposing this as an alternative to the CSV files, is that what you mean? If not you will have to explain that further.

Security check would assess the audio quality, noise,

No idea how easy this is to do. I think I know what you mean, however you are suggesting something like what on this link that subjectively assess audio quality of speech?

As I say, I've never done anything on this area, so have no idea how successful it would/could be or if its even capable of multilingual speech.

Freelance projects etc

I made AllTalk as a fun project that I do in my spare time as/when I want. I'm not a business and there's no team to speak of. What started out as a fun little project to improve memory management on my GPU when generating TTS with LLM's, turned into other people wanting to use it, support questions, writing documentation, doing a few bits of code for "any chance you could make it do this", more support requests and more documentation, so here we are now! Its been quite the journey and support alone has been consuming much of my free time.

Currently, I do have a link to a Ko-fi, though obviously I'm not currently running this as a business as I cannot guarantee to allocate my time to coding/development (unless it turns out I can make a living from that one day).

So, I very much appreciate your offer, but, I would not be able to commit to a timeline to give you a fully working environment with code.

That's not to say I wont look at & consider your requests.

I imagine that if I can get the colab working, then we are 60-70% of the way towards your requirements, and as I say, I think resolving that will probably unlock the docker build issues as the backend of Docker and Colab seem pretty similar in the issues they will present.

I think have covered everything there? Hopefully I have understood everything you asked correctly.

Thanks

1 reply

erew123 Mar 29, 2024
Maintainer

As a side note, I had a quick shot at making some code that would check and convert a JSON:

[
    {
        "audio_file": "wavs_00002000.wav",
        "text": "The cunning bait."
    },
    {
        "audio_file": "wavs_00002001.wav",
        "text": "Summer is great for its warmth, the cool and fish-filled river water."
    },
    {
        "audio_file": "wavs_00002002.wav",
        "text": "A fisherman requires nothing more for happiness."
    }
]

to a CSV (with some basic error checking)

I guess my theory is that the existing scripts already work ok csv files. So, rather than mess about with all the existing code, the simplest method would be to convert a JSON file in the format you mentioned, to the CSV file and error check it along the way.

As you have to have the format audio_file|text|speaker_name, the script will accept input names for speaker_name to build the CSV.

I also built in some basic validation for CSV files....

So, what Im getting at, as far as uploading JSON/CSV files through the interface, it should be possible to do relatively simply.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancing Accessibility and Support for AllTalk Docker Integration #150

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Enhancing Accessibility and Support for AllTalk Docker Integration #150

gboross Mar 29, 2024

Replies: 1 comment · 1 reply

erew123 Mar 29, 2024 Maintainer

Installation process (Standalone, not Docker or Google Colab)

Installation Google Colab

Installation Docker

Issues with the with the model in Hungarian language

CSV & JSON files

Security check would assess the audio quality, noise,

Freelance projects etc

erew123 Mar 29, 2024 Maintainer

gboross
Mar 29, 2024

Replies: 1 comment 1 reply

erew123
Mar 29, 2024
Maintainer

erew123 Mar 29, 2024
Maintainer