Improve audio splitting in dataset generation #419

Yohrog · 2024-11-24T22:11:52Z

This still has a lot of things that I changed during testing and might revert. These are the settings I used to get it to work locally.

Feel free to comment on any changes and I'll try to explain them.

erew123 · 2024-11-24T22:40:55Z

@Yohrog will have to look at this tomorrow. Its both late here and Im a little caught up dealing with people on other support bits! Thanks for sending it over though, will get back to you later.

Yohrog · 2024-11-24T22:55:29Z

@erew123 No worries. I mean there's no rush with this anyway, and as I said, there's still some changes I'm unhappy with. Take your time and I'll update it over time. Once I'm happy I'll mark it as ready. No need to review it before that if you don't find the time :)

Yohrog · 2024-11-25T11:51:09Z

@erew123 A quick question regarding the final saved files. During dataset generation, just before we save the snippets, we separate out the sentences and trim them, resulting in a lot of files that are shorter than two seconds. Is that sentence splitting and trimming of the files really necessary? Because it results in a lot of lost training data and in the functions before that we go the extra mile to try and EXTEND those segments past two seconds. It seems like we're doing and undoing the same thing.

Trimming happens here:

alltalk_tts/finetune.py

Line 1764 in f16f6b9

def process_transcription_result(

erew123 · 2024-11-26T07:15:29Z

Hi @Yohrog Sorry its taken a while to get back to you, just busy with bug fixing, support tickets etc. Every time I tried to make a moment to reply there would be another email/bug/something going on.

So the idea was that if something got chopped up too small, you would whack it back together with something else, but then of course you may need to find the boundaries/edges of the newly merged audio, which I guess is what you are getting at, that it can result in chopping away data there. But that was the principle of it.

So I guess we would say that the original code was:

Finding speech segments
Try to extend short segments
Then split them again on sentences (where possible)
Discarding if what was left was too short

And I think your proposed code is (I don't know where you are at now with it of course):

Finding speech segments
Merge/extend appropriately as possible
Keep sentence information without physically splitting audio

And you're storing extra information about the sentence boundaries throughout the whole process? That's my best guess/rough take at what you're proposing Sorry if I've got that wrong, I think my head is spinning with sorts of code after the last 16 hours :/

erew123 · 2024-11-30T06:34:35Z

@Yohrog Just to be double double sure of things, I have been through all the coqui training scripts with 2x AI's (too much for me to comprehend and pull together in my head).

TLDR 1: 3 seconds to 11 seconds is a good audio length clip and will pass through ALL the Coqui scripts, Huggingface scripts etc.
TLDR 2: If you want to use longer audio than 11.6 seconds, you should increase (or we could set it higher) this on Step 2 Training

As such, 3 second minimum to 11 second maximum would appear to be a very good spot to aim for with audio clip size.

Here are some snippets of of AI Reponses:

Based on the provided files and configurations, here’s the detailed analysis and conclusion regarding the minimum and maximum lengths of audio clips for training and how the model handles clips longer than its defined limits:

Minimum and Maximum Audio Clip Length for Training

Minimum Audio Length:
- Defined in the dataset preparation:
  - min_audio_len is often set explicitly in the TTSDataset class or through configurations like GPTArgs.
- Value: The minimum length is 3 seconds (defined as min_conditioning_length=66150 at 22050 Hz).
Maximum Audio Length:
- Multiple caps exist:
  - max_conditioning_length=132300 (6 seconds at 22050 Hz) governs the audio used for conditioning.
  - max_wav_length=255995 (~11.6 seconds) sets the absolute hard cap for raw audio.
- Value:
  - Audio clips up to ~11.6 seconds can be included in training.
  - However, only up to 6 seconds will be used for conditioning at any given time.

Handling Audio Longer Than 11.6 Seconds

Training-Level Constraints:
- If an audio clip exceeds the max_wav_length (~11.6 seconds), it will likely be filtered out by dataset preprocessing (filter_by_length in TTSDataset).
- Thus, audio exceeding 11.6 seconds won’t even be included in the dataset for training.
Conditioning-Level Constraints:
- During training, the model only considers up to max_conditioning_length=6 seconds of audio for conditioning.
- If an audio clip is longer than 6 seconds (but less than 11.6 seconds), it will be chopped into segments of up to 6 seconds for use in each epoch.

Epoch-Wise Processing of Longer Audio

If an audio clip is longer than 6 seconds (but ≤11.6 seconds):
- The model extracts segments of 6 seconds (max_conditioning_length) for conditioning.
- Each epoch might use a different segment of the same audio clip if implemented to randomize segments.
If an audio clip is longer than 11.6 seconds:
- The dataset preparation (TTSDataset) will reject the audio during filtering (filter_by_length).

Final Constraints

Audio Training Constraints:
- Minimum audio length: 3 seconds.
- Maximum audio length: ~11.6 seconds.
Conditioning Length:
- Fixed to 6 seconds per segment.
Excess Audio Handling:
- Audio longer than 6 seconds is chopped into 6-second segments during training.
- Audio longer than ~11.6 seconds is excluded entirely.

Max audio length will over-ride with this setting.....obviously at the expense of memory. Finetuning is set at the 11.6 seconds as that is the Coqui suggested default...

Re, will it skip anything over 11.6 seconds OR use only the first 11.6 seconds of a longer file:

When I say, "Audio longer than ~11.6 seconds is excluded entirely," it means that the entire file would likely be skipped during training. This happens because the dataset preprocessing pipeline (TTSDataset or similar) explicitly filters out audio clips that exceed the max_audio_len or equivalent threshold.

Here’s why:

Preprocessing Filters Out Long Clips:
- The TTSDataset class, or similar preprocessing logic, includes parameters like max_audio_len or max_wav_length.
- If an audio file exceeds this threshold, it is ignored during dataset creation. The remaining portion of the audio (beyond 11.6 seconds) would not even be considered for segmentation or further processing.
Segmenting Audio into Chunks:
- If the logic allowed segmenting long clips into smaller chunks, the 20-second file could be broken into multiple 6-second or shorter segments for training.
- However, in the current configuration (based on the provided code and descriptions), this does not appear to be the case—the file is entirely skipped if it exceeds the maximum allowed length.
Behavior to Confirm:
- If you want the model to use the first portion (e.g., the first 11.6 seconds) and disregard the rest of the clip, you would need to explicitly modify the dataset preprocessing logic to allow truncation rather than skipping.

How to Test or Modify This Behavior

Test Existing Behavior:
- Include an audio file longer than 11.6 seconds in your dataset and observe:
  - Is the file included at all in training?
  - Are portions of the file (e.g., the first 11.6 seconds) used, or is the entire file skipped?
Modify Preprocessing to Allow Truncation:
- In the TTSDataset or equivalent class, instead of skipping files longer than max_audio_len, truncate them to the maximum allowed length.
- For example, modify the logic where lengths are checked (e.g., in filter_by_length) to retain and truncate files instead of discarding them.
Reconfigure Max Lengths:
- Increase max_audio_len or max_wav_length to allow longer files and test whether they are processed as expected.

By default, the current implementation appears to skip files entirely if they exceed the length cap. Modifications would be required to enable partial usage of longer files. Let me know if you’d like help pinpointing where to make these changes!

Yohrog · 2024-12-01T15:32:45Z

Hi @erew123,
sorry for going silent for a week. Have been busy with other stuff unfortunately.

I'm back on it now and will finish it today (and test for bugs on my end).
Thanks for all the input! I'm gonna make sure the following behavior, based on your comments, is included.

Audio files will be split at pauses between speech longer than 400ms up to max_audio_length. I'm gonna pass the parameter to Silero_Vad, which will then try to split things at 100ms pauses if the length exceeds the limit. If there are no pauses, it's going to aggressively split the segment at the limit.
Whisper transcriptions will be used to find the end and beginning of sentences. However, it will not split the sentence chunks into separate training files anymore, since that causes too many short audio segments. Instead, segments will be trimmed, to ensure they begin and end with complete sentences and silence on both ends will be reduced down to a buffer size.
Segments will be saved at whatever length they end up being, since it is guaranteed based on previous checks to be between min_audio_len and max_audio_len.
There was also a bug in the old code, that would overwrite previously saved segments, effectively deleting data and overwriting segments. Fixed now.
I'll test all of this on a finetune myself and mark the PR as ready once I'm sure there are no new bugs on my end. Before merging, it would be great if you could test it on your end as well, just to make sure.

If you'd like anything different let me know.
I'll spend the rest of the evening testing and fixing. The changes mentioned here are only local on my machine for now. I'll get the commit ready by the end of the night.

erew123 · 2024-12-01T16:08:23Z

Hi @Yohrog Thanks for your reply and I completely understand! We all have life to get on with and I certainly have my own fair share of life going on! Thanks so much though, what you have managed to achieve sounds awesome and I look forward to testing it! And of course, no rush! I've got plenty to be on with myself, but I will test it whenever you send it over!

The only 1x thing I did for someone in the last few days was add "hi" hindi as an option, as the 2.0.3 model supports hindi.....but whisper didn't appear to work! Not sure if you want to push "hi" in your code...

#424 (comment)

Thats my conversation with them and this is Whisper saying it supports hindi https://github.com/openai/whisper/blob/main/whisper/tokenizer.py#L28

And just as I type this.... it hits me......I bet its silero that doesn't support certain languages!!... damn haha https://github.com/snakers4/silero-models?tab=readme-ov-file#further-reading

I guess I will have to put a note in the help and an auto disable Silero if its not en, ru or zh

As I say, no panic! Enjoy your weekend and it gets here when it gets here :)

Thanks again!

erew123 · 2024-12-04T02:14:37Z

Hi @Yohrog Sorry its taken me a day or so to get back to you! Thanks for working on this, I know how much of a pain it is to be having to make a code change, run a dataset look at it, go back to the code and repeat etc. Trust me, I'm building the RVC training at the moment and thats not been a happy time hah.

Anyway, I used your build from the last update: [Fix negative segment start] [bde6e7]. Used default settings, but also tried with a custom project name, just in case.

Unfortunately, that threw a bug. I'm guessing maybe you uploaded an in-progress version? I did re-download the file, just to be double sure I hadn't downloaded the wrong one.

[FINETUNE] [INFO] Initializing output directory: D:\testingalltalk\alltalk_tts\finetune\tmp-trn
[FINETUNE] [MODEL] Using device: cuda
[FINETUNE] [MODEL] Loading Whisper model: large-v3
[FINETUNE] [MODEL] Using mixed precision
[FINETUNE] [MODEL] Initializing Silero VAD
Using cache found in C:\Users\useraccount/.cache\torch\hub\snakers4_silero-vad_master
[FINETUNE] [INFO] Updated language to: en
[FINETUNE] [INFO] Processing: interview
Traceback (most recent call last):
  File "D:\testingalltalk\alltalk_tts\finetunenew.py", line 3827, in preprocess_dataset
    pd_train_meta, pd_eval_meta, pd_audio_total_size = format_audio_list(
                                                       ^^^^^^^^^^^^^^^^^^
  File "D:\testingalltalk\alltalk_tts\finetunenew.py", line 1241, in format_audio_list
    process_transcription_result(
TypeError: process_transcription_result() takes 12 positional arguments but 14 were given
Traceback (most recent call last):
  File "D:\testingalltalk\alltalk_tts\alltalk_environment\env\Lib\site-packages\gradio\queueing.py", line 536, in process_events
    response = await route_utils.call_process_api(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\testingalltalk\alltalk_tts\alltalk_environment\env\Lib\site-packages\gradio\route_utils.py", line 322, in call_process_api
    output = await app.get_blocks().process_api(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\testingalltalk\alltalk_tts\alltalk_environment\env\Lib\site-packages\gradio\blocks.py", line 1945, in process_api
    data = await self.postprocess_data(block_fn, result["prediction"], state)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\testingalltalk\alltalk_tts\alltalk_environment\env\Lib\site-packages\gradio\blocks.py", line 1717, in postprocess_data
    self.validate_outputs(block_fn, predictions)  # type: ignore
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\testingalltalk\alltalk_tts\alltalk_environment\env\Lib\site-packages\gradio\blocks.py", line 1691, in validate_outputs
    raise ValueError(
ValueError: An event handler (preprocess_dataset) didn't receive enough output values (needed: 6, received: 3).
Wanted outputs:
    [<gradio.components.label.Label object at 0x0000026CC26A34D0>, <gradio.components.textbox.Textbox object at 0x0000026CC26F9D10>, <gradio.components.textbox.Textbox object at 0x0000026CC2720790>, <gradio.components.textbox.Textbox object at 0x0000026CBED2EE10>, <gradio.components.textbox.Textbox object at 0x0000026CC26893D0>, <gradio.components.textbox.Textbox object at 0x0000026CC3A1A4D0>]
Received outputs:
    ["The data processing was interrupted due to an error!! Please check the console to verify the full error message!
 Error summary: Traceback (most recent call last):
  File "D:\testingalltalk\alltalk_tts\finetunenew.py", line 3827, in preprocess_dataset
    pd_train_meta, pd_eval_meta, pd_audio_total_size = format_audio_list(
                                                       ^^^^^^^^^^^^^^^^^^
  File "D:\testingalltalk\alltalk_tts\finetunenew.py", line 1241, in format_audio_list
    process_transcription_result(
TypeError: process_transcription_result() takes 12 positional arguments but 14 were given
", "", ""]

Happy to take a look if you too busy.

Let me know.

Thanks

Yohrog · 2024-12-04T07:35:54Z

@erew123 Yeah, it started throwing more bugs than I would've liked and I'm still weeding them out. Thanks for testing it, I appreciate the stack trace!
It is still a WIP build atm. As I said, as long as the PR is marked as draft you don't really need to bother testing it, since the bugs are still endless. I'll be back working on it today, but I don't wanna make promises I can't hold, so I hope I'll get the problems figured out and I'll also mention you in another comment here, once it's ready for testing.

erew123 · 2024-12-04T09:33:52Z

@Yohrog I have to travel for 5-8 days anyway, so wouldn't be able to test etc, so as always, no rush! I wasnt sure with your last upload if I should test or not, but as I have to travel, thought Id at least try it and let you know.

erew123 mentioned this pull request Nov 29, 2024

[FEATURE REQUEST] Audio Splitting more accurately with an llm #431

Closed

First draft of improving the dataset generation

ff8e76c

Yohrog force-pushed the alltalkbeta branch from 06d7948 to ff8e76c Compare December 1, 2024 15:18

yoshuafrey and others added 3 commits December 2, 2024 01:47

WIP: Terminate segments at last sentence end

2cbbb81

Fix audio trimming and remove redundant length checks

56f08ef

Fix negative segment start

bde6e72

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve audio splitting in dataset generation #419

Improve audio splitting in dataset generation #419

Yohrog commented Nov 24, 2024 •

edited

Loading

erew123 commented Nov 24, 2024

Yohrog commented Nov 24, 2024

Yohrog commented Nov 25, 2024 •

edited

Loading

erew123 commented Nov 26, 2024

erew123 commented Nov 30, 2024

Yohrog commented Dec 1, 2024 •

edited

Loading

erew123 commented Dec 1, 2024

erew123 commented Dec 4, 2024

Yohrog commented Dec 4, 2024

erew123 commented Dec 4, 2024

Improve audio splitting in dataset generation #419

Are you sure you want to change the base?

Improve audio splitting in dataset generation #419

Conversation

Yohrog commented Nov 24, 2024 • edited Loading

erew123 commented Nov 24, 2024

Yohrog commented Nov 24, 2024

Yohrog commented Nov 25, 2024 • edited Loading

erew123 commented Nov 26, 2024

erew123 commented Nov 30, 2024

Minimum and Maximum Audio Clip Length for Training

Handling Audio Longer Than 11.6 Seconds

Epoch-Wise Processing of Longer Audio

Final Constraints

How to Test or Modify This Behavior

Yohrog commented Dec 1, 2024 • edited Loading

erew123 commented Dec 1, 2024

erew123 commented Dec 4, 2024

Yohrog commented Dec 4, 2024

erew123 commented Dec 4, 2024

Yohrog commented Nov 24, 2024 •

edited

Loading

Yohrog commented Nov 25, 2024 •

edited

Loading

Yohrog commented Dec 1, 2024 •

edited

Loading