-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RTS video channels have subtitles but ytdl does not support them #21438
Comments
hey @goggle, the only reason i can think of why they wouldn't pull your fix is due to the fact you may have not maintained the backwards compatibility with the original extractor per your own notes in #14725? is your #14725 fix still currently working for rts subtitles? i'm thinking of just cloning the repo locally and pulling your fix to run a separate copy for rts subtitles. it will also allow me to learn more about the extractors so i can fix some issues with subtitles present on other sites that ytdl doesn't see but are there on playback. not being familiar with streaming video, it appears the key is to
ie. ytdl can be used to both display subs (--list-subs) as in step 2 above and actually do the download (--write-srts, etc) as in step 3 meaning you have to identify the underlying subtitle file so that you can later download it.. that is to say they are two different atomic ops within ytdl. does that sound correct? for me the difficult part is finding the subtitle metadata file / delivery mechanism since i'm not a streaming expert. once that part is discovered (step one above), extracting the data should be pretty straightforward (i think?). is it fair to say that many many sites use their own proprietary methods to provide this subtitle video metadata or is it fairly consistent? i'm trying to figure out the best way to approach this going forward b/c there are other sites which ytdl fails to find the subs and i'd just assume fix them myself vs waiting for someone else to do it and or if someone else fixes them, wait for them to be pulled back into main thanks for any assistance you can provide |
Hey @boulderob Feel free to use the code from #14725. I haven't tested it recently, but I guess it should still work. Extracting subtitles is a fairly easy task:
That's it. The rest does To improve the SRG SSR extractor, IMO the following steps are needed:
|
hey @goggle thx for the quick reply. what i mean is how do you specifically go about finding the json file names used in the regex that define the subtitles? unless i missed stg, i didn't see them embedded in the rts web pages and i couldn't find them by inspecting the network output in the firefox dev tools last time i tried to view the videos to troubleshoot this on my own. you already identified them and supplied the regexs, so it's easy to just point to them and say "they're the json files". but identifying what those urls are in the first place is the hard part. once you know what they are everything becomes easier :) obviously the original developer who created the original extractor for RTS wasn't smart enough to find those subtitle json files either which is why they didn't include them in the extractor. you were. so my question is how do you go about determining where the subtitle info (json or other metadata file) is in the first place? that way i can leverage this work to other sites that lack the same functionality when i need to. this part about how you discover what the subtitle metadata file actually is is the missing link / mystery. also, once you know what the json url is, you obviously just pass it to the stock ytdl thx |
cont'd : for example on a supported german tv site i'm running into the same problem. streamed videos have subs but ytdl reports none. if you go to the actual extractor for ard (the german site): https://github.com/ytdl-org/youtube-dl/blob/master/youtube_dl/extractor/ard.py it tries to look for subtitles based on a var here's an example video link
|
Regarding the
To figure out how things exactly work, you need to try to get some understanding of the SRG SSR API. You are on the right track when you use the Firefox or Chrome developer tools in the browser! Let's have a look at this video: https://www.rts.ch/play/tv/couleurs-dete/video/couleurs-dete?id=10550033 |
ah.. super thx. i can see the json file. more importantly knowing this, i don't necessarily even need ytdl to access the vtt file but can view the vtt file directly in the browser or manually download it once i know the path. awesome! part of the problem was that i was actually looking for the information after i clicked the video to actually stream thinking that the subtitle info wouldn't be available in the browser dev network console until i did so. in fact, the link to the metadata file is available on the initial page load. ok so on the initial page load, tons of resources are downloaded. unless of course the metadata file is embedded and easily available on the initial page (which today probably is seldom the case), should i then just assume it's getting loaded via json / xhr, so that i can decrease the amount of resources i have to check in the dev console (grep on json / xhr requests) on different sites to find the subtitle metadata file or are a multitude of schemes possible? i guess what i'm asking is how much digging you normally have to do to find the metadata info in the dev console and whether it's trial and error clicking on things or you can narrow the search down quicker via some means.. perhaps even a grep search on the resources in teh dev console update: since the video id aka vid is important, i was definitely able to narrow down the dev console resources by grepping on the vid which dramatically decreases what i have to inspect in the dev console. i guess after that it's going to be trial and error clicking on those links though until you pull up the metadata you're looking for correct? one final js question for you. since the json metadata is already retrieved via xhr on initial page load, the json data must already exist and be loaded into an internal js data structure. is there any way to inspect that data structure via the dev console (ie inspect the live data on the current page in browser memory) vs just reloading the json file in a separate browser tab to see what's in it? if you have a link showing how to best do this that's fine (preferably firefox vs chome dev tools). regardless. thx. you helped me have a much better understanding of how to go about this. i'm going to see if i can decipher the ard stream now |
ok thx for the tip on inspecting via the response tab. it's almost the same thing as simply retrieving the json file itself in a tab. i need to get much better with exploring the live js stack as it exists in the browser memory / dev console so i can understand fully what the js is doing in real time. i've searched in the past but didn't find much good info on the web. fyi, i was indeed able to find the vtt file for the ard stream as well! it does indeed look like this will be highly variable from one site to another and is likely to always require a little clicking and exploring as you suggest but my guess is you get a feel for it as this only took me a few minutes to find what i was looking for. i think they may have already tried to support this in ard but maybe the underlying url regex changed and so it isn't picking it up anymore. based upon your relative lack of success pushing your rts changes back to public master, i'm not sure trying to fork and push is the way i want to go. i might just clone public master and pull to keep it up to date. then just make my own changes locally and merge with master. if i want to pull your forked / downstream rts fix on a onetime basis into this local repo, what's the best way to do that? thx as usual for the help |
I'm not a
Change inside that repository and add my fork as a remote:
Pull the
Note that this will lead to conflicts (since the SRG SSR information extractor got some updates meanwhile). You need to resolve these conflicts manually by editing Alternatively, you can directly use my forked repository:
Note that by doing it this way you have a very old |
i was pretty much planning on something similar to the first method with a merge and then i'll just continue with the upstream repo pulls to keep it up to date. if i just pull yours i'm going to be out of date with the upstream. last thing for you is whether you use / recommend a python debugger in standalone mode or combined with an ide / editor for any of this |
I've never really used a Python debugger or sophisticated IDE to work with Python... I just use an editor (vim or VS Code). To test if things work when developing on
I usually use the caveman's method (adding some But I can recommend using |
ok great. @goggle thx for the assistance today. i really appreciate it. i can finally start to work with certain subtitles that i haven't been able to for quite some time. cheers! |
@goggle i've been pretty successful getting the subtitles with your help but i now have an instance where ytdl is is actually not finding the RTS video itself for download on a page. i've narrowed down the link to just the video itself that was embedded in a parent page but ytdl still can't recognize the link. i'm trying to find a way to get the final streamed link and then just manually download it outside of ytdl with a linux utility of some kind perhaps ffmpeg or stg else. would you have any insights to this particular link here as to how to best go about doing that: https://player.rts.ch/p/rts/embed?urn=urn:rts:video:10468814 as i said, ytdl doesn't recognize the url so it can't process it for video dl so i'm wondering how to do it manually from the cmd line. ie nothing is see in the firefox dev console is leading me to anything (useful video links) that i can download using curl, etc or perhaps i'm using the wrong utility for a stream. thx |
The subtitles of this video seem to be hard-coded into the video, so there is no way to extract the subtitles from it.
|
i know the subs are hard-coded. i don't want to extract them i just want to download the video itself but ytdl wasn't doing that based on the links i was providing it. what you recommended does work but you did not supply an actual web url to ytdl. for instance, this was the original rts site url i had https://pages.rts.ch/docs/10446655-harry-dean-stanton---partly-fiction.html ytdl doesn't find any formats with that link so it can't download it. so i narrowed it down to: https://player.rts.ch/p/rts/embed?urn=urn:rts:video:10468814 thinking ytdl might be able to handle this but it couldn't do that either. basically you just took the vid id and plugged it into a known format ie BUT.. for grins and other sites potentially not handled by ytdl.. which js / xhr is actually creating the full video url in the firefox dev console for my example above? the full url has to exist there for the ytdl code to find the actual video for download. AND... ytdl is just a wrapper for something else (probably ffmpeg???) to handle the download and reassembly of vid stream packets .. so if i know how to obtain the actual vid download url from the step above, can't i just skip ytdl all together and run that command directly to get the video file? if so what is that command? i guess where i'm at is that sometimes a one off download is going to make more sense then modifying or adding a new feature to the ytdl code for 1 or 2 videos. it's kind of like knowing how to fish directly when i want vs always having to go to the ytdl fish market to get my video if that makes sense. thx |
ok i can see that it definitely is using ffmpeg and i can even see the link being used for this file when i run ytdl from teh command line :) so they key is finding in the firefox dev console the vid url / file (for a given resolution) for any streamed video and just plugging it into the right incantation of ffmpeg on the command line |
The URLs that you mention are simply not supported by the SRG SSR extractor in I cannot answer your question about other sites. It's really everywhere different, that's also the reason why so many information extractors exist in Yes, |
hey @goggle, i think we're repeating the same thing here. obviously someone else figured out the rts "master url" in the past for downloading vids and all anyone needs to do now is supply new regex's to |
@boulderob Did you have any success in what you were trying to achieve? |
hey @goggle, it's been awhile and i'm sorry to say i didn't get around to doing this in terms of making any necessary changes for a pull request or anything. i have been able to write some basic scripts that allow me to pull subs when i need them which is quicker than trying to figure out the requirements for a full ytdl fix. two things though
i can find the vtt subs for this video. it's via 2 degrees of separation though because they are segmented and stream from an m3gu file! based on research it looks the subtitles for videos on this site all behave the same way. with some ffmpeg foo, i can remux the segments from the m3gu file into a single vtt file that looks legit. the problem is that unlike every other downloaded videos where the mp4 file and the subtitle vtt file are separate but vlc is able to recognize the vtt file if it's named the same, vlc is recognizing my muxed vtt file but it never displays any actual vtt files in vlc mac! all my other download mp4s work great with separate vtt files though so it's only these muxed vtt files that don't work! i've spent a lot of time on this and am getting nowhere. i am not finding any other muxed vtt / subtitle files in the codebase of ytdl*. they appear to be mostly single media files that you just download and everything works. note that if i use ffmpeg to burn the vtt muxed file into the mp4 the subtitles work! however, i don't want to do that and if i get around to making this work and want to check in a fix, i have to have a solution that works natively with ytdl. my guess is it has to do with the muxed format. if you're busy no worries. i just thought you might have some insigts on this. thx |
@boulderob Can you share your basic script to download subs? If possible I would like to download the subs from the daily news, like https://www.rts.ch/play/tv/popupvideoplayer?id=11287161 |
@boulderob: I have the same ask, could you please share the method of downloading the subs from RTS.CH, thanks in advance |
Authored by fstirlitz Modified from: ytdl-org/youtube-dl#6144 Closes: #73 Fixes: ytdl-org/youtube-dl#6106 ytdl-org/youtube-dl#14977 ytdl-org/youtube-dl#21438 ytdl-org/youtube-dl#23609 ytdl-org/youtube-dl#28132 Might also fix (untested): ytdl-org/youtube-dl#15424 ytdl-org/youtube-dl#18267 ytdl-org/youtube-dl#23899 ytdl-org/youtube-dl#24375 ytdl-org/youtube-dl#24595 ytdl-org/youtube-dl#27899 Related: ytdl-org/youtube-dl#22379 ytdl-org/youtube-dl#24517 ytdl-org/youtube-dl#24886 ytdl-org/youtube-dl#27215 Notes: * The functions `extractor.common._extract_..._formats` are still kept for compatibility * Only some extractors have currently been moved to using `_extract_..._formats_and_subtitles` * Direct subtitle manifests (without a master) are not supported and are wrongly identified as containing video formats * AES support is untested * The fragmented TTML subtitles extracted from DASH/ISM are valid, but are unsupported by `ffmpeg` and most video players * Their XML fragments can be dumped using `ffmpeg -i in.mp4 -f data -map 0 -c copy out.ttml`. Once the unnecessary headers are stripped out of this, it becomes a valid self-contained ttml file * The ttml subs downloaded from DASH manifests can also be directly opened with <https://github.com/SubtitleEdit> * Fragmented WebVTT files extracted from DASH/ISM are also unsupported by most tools * Unlike the ttml files, the XML fragments of these cannot be dumped using `ffmpeg` * The webtt subs extracted from DASH can be parsed by <https://github.com/gpac/gpac> * But validity of the those extracted from ISM are untested
Authored by fstirlitz Modified from: ytdl-org/youtube-dl#6144 Closes: #73 Fixes: ytdl-org/youtube-dl#6106 ytdl-org/youtube-dl#14977 ytdl-org/youtube-dl#21438 ytdl-org/youtube-dl#23609 ytdl-org/youtube-dl#28132 Might also fix (untested): ytdl-org/youtube-dl#15424 ytdl-org/youtube-dl#18267 ytdl-org/youtube-dl#23899 ytdl-org/youtube-dl#24375 ytdl-org/youtube-dl#24595 ytdl-org/youtube-dl#27899 Related: ytdl-org/youtube-dl#22379 ytdl-org/youtube-dl#24517 ytdl-org/youtube-dl#24886 ytdl-org/youtube-dl#27215 Notes: * The functions `extractor.common._extract_..._formats` are still kept for compatibility * Only some extractors have currently been moved to using `_extract_..._formats_and_subtitles` * Direct subtitle manifests (without a master) are not supported and are wrongly identified as containing video formats * AES support is untested * The fragmented TTML subtitles extracted from DASH/ISM are valid, but are unsupported by `ffmpeg` and most video players * Their XML fragments can be dumped using `ffmpeg -i in.mp4 -f data -map 0 -c copy out.ttml`. Once the unnecessary headers are stripped out of this, it becomes a valid self-contained ttml file * The ttml subs downloaded from DASH manifests can also be directly opened with <https://github.com/SubtitleEdit> * Fragmented WebVTT files extracted from DASH/ISM are also unsupported by most tools * Unlike the ttml files, the XML fragments of these cannot be dumped using `ffmpeg` * The webtt subs extracted from DASH can be parsed by <https://github.com/gpac/gpac> * But validity of the those extracted from ISM are untested
Checklist
Description
the rts.ch website has numerous video show channels and most of them have support for subtitles. for example if you play either of these two videos (from two different program channels on rts) in the browser you can see that you can enable and disable subtitles on the fly and that they are not hard coded / embedded in the video itself.
https://www.rts.ch/play/tv/passe-moi-les-jumelles/video/teddy-des-papillons-dans-les-yeux--bernard-sauveur-de-greniers?id=9524098
https://www.rts.ch/play/tv/temps-present/video/50-ans-les-romands-dans-loeil-de-temps-present-45-il-etait-une-fois-les-migrants-italiens?id=10421935
the problem is that when i try to use youtube-dl to get the subtitle files it can't find them. is there a way to update the code to locate and retrieve the subtitle files b/c they are indeed there?
i'm using the latest version
here's the output from the command to list the subs for one of the videos above:
subtitle support has been lacking for rts.ch for at least 2 years. i'm just finally submitting a report
The text was updated successfully, but these errors were encountered: