Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Associate multiple version of reference #54

Open
pkufool opened this issue Sep 11, 2023 · 0 comments
Open

Associate multiple version of reference #54

pkufool opened this issue Sep 11, 2023 · 0 comments

Comments

@pkufool
Copy link
Collaborator

pkufool commented Sep 11, 2023

Some audios might have multiple versions of reference (for example the youtube have automatic and manual subtitles), to associate both of these reference to the audio segments, I think we can first align the audio with one of the reference, and then we can get the
"begin_byte" and "end_byte" of the choosen reference for each segment. We can associate the second reference by doing a levenshtein alignment between the first reference and the substring of second reference determined by "begin_byte" and "end_byte" (of course, need some extending on both side), we suppose the two references are very close.
If doing this way, we don't have to change the core part of our code, we just store the second reference in a custom filed in the cut, and associate it when writing out the results, the segments are short (from 2 seconds to 30 seconds) so the levenshtein alignment would be very fast and we can do it in parallel with multiple cpu cores.

@danpovey @npovey If you need help at this, it would be good if you can share a small subset of the data to me so I can add an example recipe to the project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant