Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better captioning #8

Open
veltman opened this issue Jul 21, 2016 · 8 comments
Open

Better captioning #8

veltman opened this issue Jul 21, 2016 · 8 comments
Milestone

Comments

@veltman
Copy link
Contributor

veltman commented Jul 21, 2016

Have a mostly-working branch that allows for entering and positioning multiple captions, but the manual entry/interface is a real drag, especially for a long video. Worth exploring some improvements.

Forced aligners?

Using a forced aligner like Gentle to take a bulk transcript and automatically time it to the audio would help - then you could type in the whole thing (or paste from a transcript) and it could automatically break it into chunks.

Pros: Much faster if you have a full transcript already (paste the whole thing rather than pasting line-by-line and tweaking the timing).
Cons: Not much faster if you don't have a transcript. A lot more code complexity (all the OSS aligners seem to be Python). Would probably still need to tweak the captions into sensible breaks (e.g. avoid orphan words).

Auto transcribe

Use some sort of speech-to-text to take a first pass at transcribing the audio. In-browser options include PocketSphinx and the Web Speech API in certain browsers. Server-side options include normal Sphinx or the Watson API.

Pros: Great when it works.
Cons: Doesn't always work, especially for non-English languages or clips with music, background noise, etc. Still doesn't work out timing. If it's server-side, would require a second round-trip before the form submission. Could take a long time for long pieces of audio.

Parse timestamped transcripts?

Could allow people to upload an SRT or some other timecoded transcript format in the editor. The parsing wouldn't be that hard, but it's unclear how often audio orgs use these.

@veltman veltman changed the title Support SRT files for closed captioning Better closed captioning Jul 29, 2016
@veltman
Copy link
Contributor Author

veltman commented Jul 30, 2016

Looks like the Web Speech API doesn't provide any way to connect it to a non-mic source, but PocketSphinx does (with some fiddling).

@veltman veltman changed the title Better closed captioning Better captioning Aug 1, 2016
@veltman veltman mentioned this issue Aug 1, 2016
Closed
@veltman veltman added this to the v1.0 milestone Aug 1, 2016
@kookster
Copy link

kookster commented Aug 1, 2016

you could also use other APIs like speechmatics (https://speechmatics.com/), or https://cloud.google.com/speech/ ?

@veltman
Copy link
Contributor Author

veltman commented Aug 1, 2016

Yup, true - though I'm a little reluctant to rely on an external API rather than something that can be bundled (ditto Watson).

@pietrop
Copy link

pietrop commented Nov 30, 2016

Hey @veltman,
Gentle could be modified to generate a transcription when the text is not available. This already works in the REST API, see the curl example if you don't pass the text file it returns a transcription. but it doesn't work in the python terminal command. The code would need to be modified accordingly, which is something I am looking into.

I also played a round with pocket sphinx, packaging it as a node module https://github.com/OpenNewsLabs/offline_speech_to_text.
I extracted it from video grep electron app.

@iankevinmcdonald
Copy link

Considering that the effective maximum on social media is 30s, I think that expecting users to supply a transcript is absolutely fine.

It doesn't scale to generating complete videos from long-form shows, but I think that's acceptable - it's still a big benefit for most uses.

I'm a one-person band working on my own community/radio niche narrative history series, and I've used SRT, using a free online manual transcriber (called, originally enough, "Transcriber"). Though I'm about as unrepresentative as you can possibly get.

@pietrop
Copy link

pietrop commented Jan 10, 2017

For the srt option I've wrote an srt parse composer that is also on npm.

Can be used to parse the srt into a word accurate json (original code to make it word accurate is from popcorn js srt parsing module parser also on github) with that is possible to make a "hyper transcript" where the user can make word accurate selections. I've done something similar in quickQuote (now refactored in node and in autoEdit) inspired by the hyperaudio project.

@pettarin
Copy link

pettarin commented Mar 6, 2017

Shameless plug, I hope you find it informative.

I maintain a Python/C forced aligner called aeneas ( http://www.readbeyond.it/aeneas/ and https://github.com/readbeyond/aeneas/ ). Its approach is not based on speech recognition (like Gentle and basically all other forced aligners out there), but on an older technique known as Dynamic Time Warping. It works decently well (and much faster) if you align text at sentence/phrase level, but it is worse at word-level. Its real time factor (ratio between processing time and real audio length) is between 0.005 and 0.02, depending on the parameters and machine CPU, since all the computational parts are written in C.

(In theory, one can port the core of aeneas to C, and from there to JS, via emscripten. It is a huge task, but it would enable decently fast alignment in JS land. Unfortunately, I have not had time/resources to do it.)

BTW, I maintain a list of forced aligners here: https://github.com/pettarin/forced-alignment-tools

@pietrop
Copy link

pietrop commented Sep 28, 2018

In case anyone is still looking into this turns out that @martymcguire had done a write up where he describe how he modified the BBC News Labs fork of Audiogram to work with Gentle Speech To Text Forced Aligner output, see his repo here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants