Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Insertion of numbers in transcriptions #7

Open
woojayjeon opened this issue Feb 26, 2024 · 3 comments
Open

Insertion of numbers in transcriptions #7

woojayjeon opened this issue Feb 26, 2024 · 3 comments

Comments

@woojayjeon
Copy link

In the kaldi format transcription files, it seems that sometimes there are insertions of numbers that are not present in the audio. For example:

medium/2149/mark_wnt_mp_0808_librivox_64kb_mp3/mark_2_weymouth_64kb_39 
008 028 JOHN THE BAPTIST THEY REPLIED BUT OTHERS SAY ELIJAH AND OTHERS THAT IT IS 
ONE OF THE PROPHETS 008 029 THEN HE ASKED THEM POINTEDLY BUT YOU YOURSELVES 
WHO DO YOU SAY THAT I AM

Above, 008 028 and 008 029 seem to be Bible verse numbers that are in the original text but not actually read. I have verified this by listening to the audio sample.

@pkufool
Copy link
Collaborator

pkufool commented Mar 7, 2024

In the kaldi format transcription files, it seems that sometimes there are insertions of numbers that are not present in the audio. For example:

medium/2149/mark_wnt_mp_0808_librivox_64kb_mp3/mark_2_weymouth_64kb_39 
008 028 JOHN THE BAPTIST THEY REPLIED BUT OTHERS SAY ELIJAH AND OTHERS THAT IT IS 
ONE OF THE PROPHETS 008 029 THEN HE ASKED THEM POINTEDLY BUT YOU YOURSELVES 
WHO DO YOU SAY THAT I AM

Above, 008 028 and 008 029 seem to be Bible verse numbers that are in the original text but not actually read. I have verified this by listening to the audio sample.

Em... possible. will see if there is any bug in the alignment tools.

@ex3ndr
Copy link

ex3ndr commented Jul 28, 2024

They are all over the place, i think this is a page numbers or something like this, sometimes it is even roman numbers, sometimes it is in brackets, sometimes it is just as is.

@pkufool
Copy link
Collaborator

pkufool commented Aug 2, 2024

Emm,We filter the segments according to the levenshtien distance between original text and transcript text, when the segment is long, this kind of insertions may not affect the whole distance, I mean the distance is sitll below the given threshold. Currently, have not figured out how to fix this bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants