This is the repo for the SPADE AudioBNC cleaning script which makes a subset of high quality utterances from the corpus, split into speaker tiers.
To reproduce the dataset, all that's necessary is placing the requested
textgrids in the input
directory. You must have a symbolic link(or
just a directory) to both the wav and textgrids directories, labeled
wavs
and textgrid
respectively. It is also necessary to change the
directory names at the top of the following
scripts:do_pipeline.sh, aligner_difference.py, speaker_data.py
.
PREFIX
should be changed to wherever you have put the pipeline folder,
MFA_DIR
should point to a directory containing the mfa_align
binary
for MFA. AUDIO_BNC_DIR
should point to wherever the Texts
directory
of the BNC is located. Then, simply run the do_pipeline
script, this
will take a considerable amount of time(upwards of 24 hours) to run on
the overall corpus.
-
output_dictionary.py
: Runs over all textgrids and generatespronunciation.txt
containing all words and their pronunciations for MFA. -
output_mfa_formatted.py
: Runs over all textgrids and replaces with labeled utterances for use in MFA. Also cuts each wav file to just the part used in a given textgrid again for MFA. -
aligner-difference.py
: Calculates HNR and aligner-difference for all textgrids inoutput
-
classify.py
: Goes over textgrids outputted byaligner-differenc.py
and decides whether to classify them as good or bad based on the previously described classifier. -
speaker_data.py
: Splits output fromclassify.py
into speaker tiers based on the XML transcripts. -
reduced_data_set.py
: Goes over output fromspeaker_data.py
and deletes all utterances not labeled "good". Additionally deletes tiers containing feature values.
-
requirements.txt
: List of required pip packages in python, to install runpip install -r requirements.txt
-
pronunciation.txt
: List of pronunciations for all words in AudioBNC for MFA. -
input
: A directory with all the AudioBNC textgrids you wish to clean. -
wavs
: Directory or symlink to directory of all the AudioBNC wavs. -
textgrid
: Directory or symlink to directory of all the AudioBNC TextGrids. -
output
: Output from MFA -
classify_grids
: Output fromaligner-difference.py
, TextGrids which have yet to be classified. -
corpus_for_mfa
: Directory containing TextGrids to be used by MFA. -
out_with_labels
: Classified textgrids which have not yet been speakerised. -
speakered_textgrids_chunked
: Cleaned textgrids with labels describing quality of utterances, split into speaker tiers. Still contains all data, included feature-tiers. -
cleaned_textgrids
: Final product of pipeline, to be used in SPADE. Includes "good" utterances split into speaker-tiers.