Replies: 1 comment 1 reply
-
Hi @gboross That's quite a long mail you have sent, so I will try to break down my reply into sections. I would like to say that within the last 24 hours, I have released an updated version of AllTalk, refreshed the documentation and worked on simplifying the installation requirements files (ill explain that later on). Installation process (Standalone, not Docker or Google Colab)If you are installing a standalone version on your own computer then Quick setup - Standalone Installation should be pretty simple and is also shown in this video Installation Google ColabPrior to today, google colab installations have been impossible to install due to requirements issues. However, just before your email arrived, I was co-incidentally looking at google colab to see if I could get it working. I had some successes and some issues. One of the successes looks like it might be Finetuning, though I have quite a bit to look through to understand any quirks, issues and build up a colab. If I get this working, I'm happy to provide you a link to the colab. Installation DockerThe Docker build that exists, is based on an older version of AllTalk and was not built by myself. I personally am not an expert on Docker and have had my frustrations with it. Saying that, I did build an automated process to build/update docker images every time Github was updated, however, it started failing due to requirements installation issues, hence the build being out of date. Though, with the release I sent out yesterday, my hope, is that now I have managed to clean up the installation requirements files, and with my work on Google colab, I will be able to get a working version of the installation routine for Docker. I know there will be issues that exist because of things like tunnelling redirects to access AllTalk. e.g. internally, on the docker/colab, AllTalk thinks it runs on 127.0.0.1, but externally its running on x.x.x.x IP address, which isn't an issue as far as some portions of the interface go, but some bits of code don't like it. As I say, this is yet to be explored by myself. Issues with the with the model in Hungarian languageI'm not an XTTS model expert and I also don't speak all the languages that the model is capable of. My anecdotal understanding and also experiences with the XTTS model show that extra punctuation can cause issues, outside of your standard comma, semi-comma, period etc. I detail this a little here TTS Generation Issues & Questions. I can say that the more I filtered out characters like asterisks As for it pronouncing "dot", by which I assume you mean the period What I can say though is its possible that the XTTS v2.0.3 may be better with Hungarian (check here for details of using that) TTS Generation Issues & Questions and that broadly speaking, the better the audio quality samples, along with finetuning does seem to improve the quality of the reproduced voice along with less skips, repeats and mispronunciations. Though you may wish to hunt the Coqui discussions area https://github.com/coqui-ai/TTS/discussions CSV & JSON filesCurrently the finetuning (Step 1) generates a Eval file and a Training file, both in CSV format, which are stored in There is also a The CSV files are broken into the format Its fully possible to setup your own CSV file training set, copy them onto your disk somewhere and manually edit the path to these files in Step 2, and completely bypass Step 1 (whisper). However, I assume your reference to these is that you are wanting:
And also
Security check would assess the audio quality, noise,No idea how easy this is to do. I think I know what you mean, however you are suggesting something like what on this link that subjectively assess audio quality of speech? As I say, I've never done anything on this area, so have no idea how successful it would/could be or if its even capable of multilingual speech. Freelance projects etcI made AllTalk as a fun project that I do in my spare time as/when I want. I'm not a business and there's no team to speak of. What started out as a fun little project to improve memory management on my GPU when generating TTS with LLM's, turned into other people wanting to use it, support questions, writing documentation, doing a few bits of code for "any chance you could make it do this", more support requests and more documentation, so here we are now! Its been quite the journey and support alone has been consuming much of my free time. Currently, I do have a link to a Ko-fi, though obviously I'm not currently running this as a business as I cannot guarantee to allocate my time to coding/development (unless it turns out I can make a living from that one day). So, I very much appreciate your offer, but, I would not be able to commit to a timeline to give you a fully working environment with code. That's not to say I wont look at & consider your requests. I imagine that if I can get the colab working, then we are 60-70% of the way towards your requirements, and as I say, I think resolving that will probably unlock the docker build issues as the backend of Docker and Colab seem pretty similar in the issues they will present. I think have covered everything there? Hopefully I have understood everything you asked correctly. Thanks |
Beta Was this translation helpful? Give feedback.
-
Hello,
Firstly, I apologize for any seemingly naive questions. Despite numerous efforts, the complexity of the AllTalk installation process has proven quite daunting. Is there a way to simplify and make this process more intuitive for users like us?
We're exploring the possibility of housing the entire setup within an updated Docker container, complemented by detailed, foolproof tutorials in both written and video formats, covering all aspects of installation and operation.
We currently utilize AllTalk with Docker on the cloud for STT real-time streaming. However, fine-tuning the model has become a primary goal, a task we've been unable to embark on due to not only unsuccessful standalone installations but also a lack of guidance on accessing certain functionalities showcased in the provided images.
Main issue with the model in Hungarian laguage:
For example, when using Coqui TTS in Hungarian, we encounter persistent issues such as the omission of the last letter in sentences, the erroneous reading of "dot" at sentence endings, excessive pauses, and peculiar noises. Although fine-tuning is frequently recommended as a solution, the root cause of these issues eludes us.
We recognize that both AllTalk and Coqui's original Google Colab-based Gradio UI version offer a largely automated fine-tuning procedure. Nonetheless, given the inaccuracies of tools like Whisper in recognizing Hungarian—where Whisper large_v3 shows improvement but remains flawed—we deem manual intervention necessary. This may involve reviewing and correcting transcriptions ourselves or generating and subsequently reviewing transcriptions elsewhere to ensure the accuracy and correctness of training materials.
Initially, we plan to compile our dataset with 10-15 hours of content to observe any potential improvements. If successful, we aim to create and train a more extensive dataset. Our methodology (same as on Coqui TTS dataset preparation readme) entails segmenting audio and preparing CSV and JSON files accordingly, with the goal of achieving at least 98% accuracy in our dataset.
Would it be feasible, after assembling a dataset with segmented audio files in a 'wavs' folder (e.g., wavs_00002000.wav, wavs_00002001.wav, wavs_00002002.wav), and accompanying CSV and JSON documents detailing segments such as:
And a JSON file listing entries like:
...to be uploaded via a web UI interface launched from Docker? Following the upload, a security check would assess the audio quality, noise, etc., removing any unsuitable files from the dataset and updating the CSV and JSON accordingly before commencing training. This setup should smoothly handle large datasets (up to 10 hours), facilitating an easy integration (one-click solution) with AllTalk for streaming TTS and other applications, fully compatible with DeepSpeed.
Additionally, I'm curious if you engage in freelance projects or if there's a mechanism to financially support your team to expedite the development of this Docker setup within 1-2 weeks?
Ideally, starting Docker, accessing the localhost port, and managing everything through a web UI would be simple and intuitive, eliminating the need for manual execution of various Python scripts. A cloud-based streaming Colab notebook might also offer a viable alternative.
Crucially, we should be able to retrain a model multiple times, with the training phase yielding insights and analyses on the model's progress. Facilitating the selection of which base model to train, complete with automatic model download and management within the Docker container, is also a key requirement.
Are you open to this proposal? If so, please could you advise us on how and where we might contribute financially to make this vision a reality as soon as possible?
Thank you in advance for your assistance. I would greatly appreciate a personal contact for further discussions.
Thank you.
Beta Was this translation helpful? Give feedback.
All reactions