Releases: bricky149/tlap-rs
A Minor Update
Silence is Golden (with hotfix)
This release fixes the known issue where reading back files for streaming could result in the thread it's running on to not finish before the next call. We're splitting audio files based on assumed silence now, instead of every 64k samples (four seconds). This only applies to recorded audio, not real-time captures, for now.
This also contains a hotfix for a regression where poor quality recordings were incorrectly marked as 'done' with no subtitles written, as a result of silence not being 'true' silence.
TL;DR
- Refactor streaming code to read only data written since the last read
- Split audio lines where we think there is silence, rather than every 64k samples
Strong and Stable
This release focuses on clearing up unwrap() calls and handling Err() cases. Despite additional code (and nesting), the resulting binary is smaller due to refactoring subtitle writing calls into a single function as well as how timestamps are generated from periods of time.
There may be a future release that standardises code style to match idiomatic Rust patterns. Depends if I want to spend time on other features I would appreciate, like punctuation and detecting silence instead of splitting a file's samples over four-second lines.
TL;DR
- Add a test for subtitle output
- Change Options to Results as all data is needed during execution
- Reduce chance of panicking by covering for every Err() case
- Miscellaneous SemVer-breaking changes, despite no new features
Known issues
- Given a long enough file, the thread that reads input back might run longer than the sleep period
The (inevitable) Coqui Rewrite
When this program was first written in December 2021, DeepSpeech was the only library I could use as it had Rust bindings. In February 2022, Rust bindings appeared for Coqui. It only seemed right to switch to it.
As I was doing that, I found I could no longer feed an input stream into Coqui's intermediate_decode() due to an API change somewhere either in the library itself or its bindings. Instead, I opted to write the input to file first and then read it back on another thread for Coqui's speech_to_text() to do its magic. This removed the eager looping code (thus saving CPU cycles) and the latency I was getting before. I had to switch from PortAudio to cpal to make that happen, with the positive side-effect being this feature may work on non-Linux platforms now.
I cut out code that wasn't part of the core functionality. For example, input would be resampled if it didn't match the 16kHz rate the model was using. ffmpeg does a better job of doing this so I cut out related code and crates, cutting deps by around half. I then split code into 'speech' and 'subtitle' domains, which forced me to refactor the code somewhat.
Unfortunately, Coqui doesn't link their library to a CUDA runtime. This means it'll only run on the CPU. Every CPU cycle that can be saved counts!
TL;DR
- Major codebase rewrite
- Migrated from deprecated DeepSpeech dependencies to coqui-stt
- Migrated from PortAudio to cpal, allowing for cross-platform feature parity
- Removed resampling functionality, cutting external crates almost in half
- Separated speech-related and subtitle-related code into their own files
- Reworked sub streaming as to not pin CPU usage to 100%
- Threaded sub streaming to reduce transcription latency
- As Coqui does not offer CUDA binaries, CUDA support has been removed
Known issues
- Given a long enough file, the thread that reads input back might run longer than the sleep period