The scripts uses a modified pydub package (0.24.1) to segment and normalize raw conversational/speeches files for machine learning. Compared to using the standard pydub library, this script is optimized for processing audio by removing the need to use multiple loops to segment data.
Please be aware that this script can/may be broken if any other pydub versions are used.
This script assumes that there is only one speaker. If you need to find the optimal silence threshold and length, please use 'parameter_tester.ipynb' to find the optimal values.
What this script will do is:
- Removes unnecessary long pauses/silences, but retaining natural silences which indicates the speakers thoughts or use of fillers.
- Splits the audio files into 5 second intervals. Files that are too short will be kept but labelled as "leftover"
- Normalize amplitude, chanhel, and sampling rate.
- [Future Feature] Removes background noise if applicable
- [Future Feature] Generates a unique adds id for each file
If you need to find the optimal parameters for removing silence in your audio.
-
Open parameter_tester.ipynb. This script will take a sample of your original file, which can be used to test and find the optimal silence length and threshold.
-
Run the first cell to splice a sample of your original raw audio data.
-
Adjust the parameters in
nonsilent_data = detect_nonsilent(normalized_sound, min_silence_len=4000, silence_thresh=-32, seek_step=1)
, then run the cell. It should output a series of time frames, for example: