sed (linux, mac, maybe windows) script to fix no captilization of proper nouns, punctuation trouble #957
mrfragger
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
When whisper starts not doing punctuation anymore subtitles all go lowercase. This script searches and replaces non-capitalized pronouns and capitalizes them.
200 first and last names
Continents
Oceans
Countries
U.S. States
U.S. Cities
World Capitals
World Popular Cities
Religion
7 Planets (not earth)
Days of Week
Months (not may)
mr mrs mr. mrs. dr. phd cia fbi
After Mr. Mrs. Dr. capitalizes name
i i've i'm i'll i'd (this is horrendous trying to read)
A few Historical Figures
Languages and Nationalities
Holidays
takes out any repeat words even (many many, that that)
doesn't capitalize if proper noun not all on same subtitle line
Just run script in a folder of vtt subtitles
have srt..just rename them all to vtt extension
keep your original subtitles as this will overwrite them
tested on linux and mac (requires gsed not BSD sed)
includes PDF showing all pronouns it searches for
This is a good thread 194 on the issue and it's also known as hallucinations over on some other threads
What prompted me to write the script was I had a 16h 30 min audiobook and the last 6 hours or so it lowercased everything. So my option was to run it again takes a little over 5 1/2 hours at 3x realtime on medium.en model and maybe get the same result most likely or perhaps large aka large-v2 but read large-v1 doesn't do this as often but that's around 1x realtime so 16 hours. This script fixes it in a few seconds and I can run aspell -c audiobook.vtt to fix any remaining issues. Initially was just doing aspell but it took 20 to 30 mins even with replace all as there were so many issues so I don't see it as an option when processing hundreds of audiobooks.
Updated June 1, 2023 added more nationalities, October and a few other fixes
propernouns-sed2023Jun01.zip
Beta Was this translation helpful? Give feedback.
All reactions