Skip to content

lang-uk/semesyuk-to-text

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

POC that can cut audiobooks into individual sentences using ebook and produce training datasets

Requires vosk and ukrainian vosk model.

Copyright on all the texts and audio samples in this project belongs to Ivan Semesyuk Audiobook is available for free at Слухай

To try on your own examples:

  • Download the model for your desired language from vosk website
  • Download publicly available audiobook and extract the fragment (a small chapter for example)
  • Download publicly available ebook, convert it to txt, extract the same chapter and remove unwanted fragments (page numbers, etc if any)
  • Tokenize the extracted chapter using your favorite tokenizer (each sentence on a separate line, tokens separated with and pre-defined character like space or pipe). For Ukrainian language I recommend nlp-uk tokenizer e.g groovy TokenizeText.groovy -w -i texts/raw/prologue.txt -o texts/tokenized/prologue.txt
  • Amend the notebook to point it to the right model, text and audio file
  • Run it
  • Pray
  • Review the results

Known limitations:

  • Unfortunatelly my experiments with dynamic vosk model that can use the specially crafted vocabulary to improve the recognition didn't work. It might be due the incorrectly trained model for Ukrainian I had at hands
  • On my data it gave me a happy user path, matching all the sentences properly. Your mileage might vary, especially if the quality of your model is mediocre or the audio input is bad

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published