Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trimming the files #1

Open
enlyth opened this issue Feb 17, 2023 · 1 comment
Open

Trimming the files #1

enlyth opened this issue Feb 17, 2023 · 1 comment

Comments

@enlyth
Copy link

enlyth commented Feb 17, 2023

Anyone had any luck trimming the files? I didn't have much time yet to look into it but messed around with some sox settings and couldn't find a configuration that consistently gets rid of the click in the beginning everywhere.

I think we might need to write a custom script to do this. I might give it a shot over the weekend when I have more time

@enlyth
Copy link
Author

enlyth commented Feb 20, 2023

Okay, I have figured out how to trim this decently. This script still isn't perfect, but the best result I could achieve. What would be maybe even better is to play around with the padding and run it one more time through sox after, but this can get you close enough to be able to train on the data.

I've noticed loudness normalization will be required before training as well after looking at the waveforms. Personally I would do another OpenAI whisper transcription pass after trimming and normalization, and discard outliers based on length, because this can sometimes go wrong over the course of such a huge dataset.

import os
import glob
import torchaudio
from pyannote.audio import Pipeline
from tqdm import tqdm

####################
# Config
dataset_dir = "/home/username/path/to/wav/files/"
output_dir = "/home/username/path/to/wav/output/"
padding_left_ms = 100
padding_right_ms = 25
huggingface_token = "hf_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
save_sliced = True
####################


pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization",
    use_auth_token=huggingface_token,
)


def pad(num, zeroes):
    return str(num).zfill(zeroes + 1)


def get_wav_files():
    wav_files = glob.glob(os.path.join(dataset_dir, "*.wav"))
    return wav_files


def trim(wav_file):
    output = pipeline(wav_file)
    waveform, sampling_rate = torchaudio.load(wav_file)
    num_channels, num_frames = waveform.shape

    start = output.get_timeline()[0].start
    end = output.get_timeline()[-1].end

    int_start = int(start * sampling_rate)
    int_end = int(end * sampling_rate)
    int_start_padded_clamped = max(
        0, int_start - padding_left_ms * sampling_rate // 1000
    )
    int_end_padded_clamped = min(
        num_frames, int_end + padding_right_ms * sampling_rate // 1000
    )
    tqdm.write(f"Start: {start}, End: {end}")
    tqdm.write(
        f"Start padded: {int_start_padded_clamped / sampling_rate}, End padded: {int_end_padded_clamped / sampling_rate}"
    )

    new_file = wav_file.replace(dataset_dir, output_dir)
    new_file_filename = os.path.basename(new_file)

    tqdm.write(f"New file: {new_file_filename}")

    with open(f"{output_dir}/voice_activity.txt", "a") as f:
        f.write(
            f"{new_file_filename}|{int_start_padded_clamped / sampling_rate}|{int_end_padded_clamped / sampling_rate}\n"
        )

    if save_sliced == True:
        sliced_waveform = waveform[:, int_start_padded_clamped:int_end_padded_clamped]
        torchaudio.save(new_file, sliced_waveform, sampling_rate)


def main():
    idx = 0
    wav_files = get_wav_files()
    print(f"Found {len(wav_files)} wav files.")
    for i in tqdm(range(len(wav_files)), desc="Trimming"):
        trim(wav_files[i])
        idx = idx + 1


if __name__ == "__main__":
    main()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant