-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trimming the files #1
Comments
Okay, I have figured out how to trim this decently. This script still isn't perfect, but the best result I could achieve. What would be maybe even better is to play around with the padding and run it one more time through I've noticed loudness normalization will be required before training as well after looking at the waveforms. Personally I would do another OpenAI whisper transcription pass after trimming and normalization, and discard outliers based on length, because this can sometimes go wrong over the course of such a huge dataset. import os
import glob
import torchaudio
from pyannote.audio import Pipeline
from tqdm import tqdm
####################
# Config
dataset_dir = "/home/username/path/to/wav/files/"
output_dir = "/home/username/path/to/wav/output/"
padding_left_ms = 100
padding_right_ms = 25
huggingface_token = "hf_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
save_sliced = True
####################
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization",
use_auth_token=huggingface_token,
)
def pad(num, zeroes):
return str(num).zfill(zeroes + 1)
def get_wav_files():
wav_files = glob.glob(os.path.join(dataset_dir, "*.wav"))
return wav_files
def trim(wav_file):
output = pipeline(wav_file)
waveform, sampling_rate = torchaudio.load(wav_file)
num_channels, num_frames = waveform.shape
start = output.get_timeline()[0].start
end = output.get_timeline()[-1].end
int_start = int(start * sampling_rate)
int_end = int(end * sampling_rate)
int_start_padded_clamped = max(
0, int_start - padding_left_ms * sampling_rate // 1000
)
int_end_padded_clamped = min(
num_frames, int_end + padding_right_ms * sampling_rate // 1000
)
tqdm.write(f"Start: {start}, End: {end}")
tqdm.write(
f"Start padded: {int_start_padded_clamped / sampling_rate}, End padded: {int_end_padded_clamped / sampling_rate}"
)
new_file = wav_file.replace(dataset_dir, output_dir)
new_file_filename = os.path.basename(new_file)
tqdm.write(f"New file: {new_file_filename}")
with open(f"{output_dir}/voice_activity.txt", "a") as f:
f.write(
f"{new_file_filename}|{int_start_padded_clamped / sampling_rate}|{int_end_padded_clamped / sampling_rate}\n"
)
if save_sliced == True:
sliced_waveform = waveform[:, int_start_padded_clamped:int_end_padded_clamped]
torchaudio.save(new_file, sliced_waveform, sampling_rate)
def main():
idx = 0
wav_files = get_wav_files()
print(f"Found {len(wav_files)} wav files.")
for i in tqdm(range(len(wav_files)), desc="Trimming"):
trim(wav_files[i])
idx = idx + 1
if __name__ == "__main__":
main() |
Anyone had any luck trimming the files? I didn't have much time yet to look into it but messed around with some
sox
settings and couldn't find a configuration that consistently gets rid of the click in the beginning everywhere.I think we might need to write a custom script to do this. I might give it a shot over the weekend when I have more time
The text was updated successfully, but these errors were encountered: