Trimming the files #1

enlyth · 2023-02-17T12:03:44Z

Anyone had any luck trimming the files? I didn't have much time yet to look into it but messed around with some sox settings and couldn't find a configuration that consistently gets rid of the click in the beginning everywhere.

I think we might need to write a custom script to do this. I might give it a shot over the weekend when I have more time

The text was updated successfully, but these errors were encountered:

enlyth · 2023-02-20T18:18:37Z

Okay, I have figured out how to trim this decently. This script still isn't perfect, but the best result I could achieve. What would be maybe even better is to play around with the padding and run it one more time through sox after, but this can get you close enough to be able to train on the data.

I've noticed loudness normalization will be required before training as well after looking at the waveforms. Personally I would do another OpenAI whisper transcription pass after trimming and normalization, and discard outliers based on length, because this can sometimes go wrong over the course of such a huge dataset.

import os
import glob
import torchaudio
from pyannote.audio import Pipeline
from tqdm import tqdm

####################
# Config
dataset_dir = "/home/username/path/to/wav/files/"
output_dir = "/home/username/path/to/wav/output/"
padding_left_ms = 100
padding_right_ms = 25
huggingface_token = "hf_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
save_sliced = True
####################


pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization",
    use_auth_token=huggingface_token,
)


def pad(num, zeroes):
    return str(num).zfill(zeroes + 1)


def get_wav_files():
    wav_files = glob.glob(os.path.join(dataset_dir, "*.wav"))
    return wav_files


def trim(wav_file):
    output = pipeline(wav_file)
    waveform, sampling_rate = torchaudio.load(wav_file)
    num_channels, num_frames = waveform.shape

    start = output.get_timeline()[0].start
    end = output.get_timeline()[-1].end

    int_start = int(start * sampling_rate)
    int_end = int(end * sampling_rate)
    int_start_padded_clamped = max(
        0, int_start - padding_left_ms * sampling_rate // 1000
    )
    int_end_padded_clamped = min(
        num_frames, int_end + padding_right_ms * sampling_rate // 1000
    )
    tqdm.write(f"Start: {start}, End: {end}")
    tqdm.write(
        f"Start padded: {int_start_padded_clamped / sampling_rate}, End padded: {int_end_padded_clamped / sampling_rate}"
    )

    new_file = wav_file.replace(dataset_dir, output_dir)
    new_file_filename = os.path.basename(new_file)

    tqdm.write(f"New file: {new_file_filename}")

    with open(f"{output_dir}/voice_activity.txt", "a") as f:
        f.write(
            f"{new_file_filename}|{int_start_padded_clamped / sampling_rate}|{int_end_padded_clamped / sampling_rate}\n"
        )

    if save_sliced == True:
        sliced_waveform = waveform[:, int_start_padded_clamped:int_end_padded_clamped]
        torchaudio.save(new_file, sliced_waveform, sampling_rate)


def main():
    idx = 0
    wav_files = get_wav_files()
    print(f"Found {len(wav_files)} wav files.")
    for i in tqdm(range(len(wav_files)), desc="Trimming"):
        trim(wav_files[i])
        idx = idx + 1


if __name__ == "__main__":
    main()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trimming the files #1

Trimming the files #1

enlyth commented Feb 17, 2023

enlyth commented Feb 20, 2023

Trimming the files #1

Trimming the files #1

Comments

enlyth commented Feb 17, 2023

enlyth commented Feb 20, 2023