Skip to content
This repository has been archived by the owner on Oct 18, 2024. It is now read-only.

Subsync integration #644

Open
alashow opened this issue May 8, 2019 · 3 comments
Open

Subsync integration #644

alashow opened this issue May 8, 2019 · 3 comments

Comments

@alashow
Copy link

alashow commented May 8, 2019

I found this tool called subsync that "automagically synchronizes subtitles with the video."

It would be awesome if it was integrated into Sub-Zero. Or maybe just a way to have a callback after subtitle is downloaded (calling the specified script with the paths of the media file and downloaded sub).

@pannal
Copy link
Owner

pannal commented May 12, 2019

This has been discussed at length before, in Slack and elsewhere. It would be a great addition, but in the last state I checked, it wasn't accurate enough.

@smacke
Copy link

smacke commented May 14, 2019

I have a suggestion for how subsync might be integrated with Sub-Zero such that it could still be useful despite not handling all cases with perfect accuracy (e.g. commercial breaks). This idea is based on my understanding of how Sub-Zero works, which could be wrong, so first I'll describe this understanding and then how the idea builds on that.

My understanding is that, given a video, Sub-Zero searches several different websites for subtitles that constitute a high-quality "match". Currently, this matching process is performed by attempting to match pieces of the subtitles filename against known metadata associated with the video. Or, at the very least, subtitles filenames that contain more information are considered likely to be higher-quality. (Perhaps the rationale is that whoever curated the subtitles initially took the time to carefully incorporate the additional metadata in the filename, so they probably also took the time to provide careful, high-quality subtitles.)

I believe that Sub-Zero's algorithm for choosing which subtitles to use could benefit from subsync's algorithm. This is because, at least conceptually, subsync's synchronization algorithm incorporates a content-based scoring algorithm as a subroutine. The video is transformed into a binary stream for speech and non-speech, respectively. The subtitles are likewise transformed. The idea of subsync is, if video 1's match up with subtitle 1's (resp. 0's), then we can be confident that we have a good synchronization between video and subtitles. We can score the match as (# matching bits minus # mismatched bits) or any other sensible algorithm.

There is nothing specific to synchronization about this content-based scoring, so I think Sub-Zero could also utilize it as follows:

  1. From the various websites Sub-Zero searches, download several candidate subtitles files.
  2. Apply subsync's scoring algorithm to all of them.
  3. Select the subtitles with the highest content-based score.

Of course, if all the downloaded candidates are all low-quality (if e.g. every single downloaded candidate corresponds to a completely different video, or if every candidate excludes commercial breaks present in the video), then this will not work. I think this is unlikely, however -- if the movie or TV episode was correctly identified, I think at least one of the candidates should be a good match. Furthermore, we can easily apply subsync's offset correction to handle the common case of extra stuff present (or not present) at the beginning / end of the video.

As with subsync, the bottleneck (in terms of processing speed) should be speech extraction performed on the video. The good thing is that this only needs to be done once, after which the content-based scoring can be done on several sets of subtitles.

I think this approach has an advantage of conceptual simplicity and is likely to work on most cases with much less effort compared with a 100% accurate synchronization algorithm. If this is something the Sub-Zero authors think could be useful, I would be happy to provide something in subsync's API for scoring multiple subtitle files and selecting the best one, either at the command line level or in Python (whichever would be easier). I'll defer to @pannal for whether this would make sense -- as the Sub-Zero maintainer I imagine you'll have the best sense for whether the marginal benefit (if any) from this approach is worth the investment on both our ends.

@pannal
Copy link
Owner

pannal commented Jul 5, 2019

Sorry for the extremely late answer.

My understanding is that, given a video, Sub-Zero searches several different websites for subtitles that constitute a high-quality "match". Currently, this matching process is performed by attempting to match pieces of the subtitles filename against known metadata associated with the video. Or, at the very least, subtitles filenames that contain more information are considered likely to be higher-quality. (Perhaps the rationale is that whoever curated the subtitles initially took the time to carefully incorporate the additional metadata in the filename, so they probably also took the time to provide careful, high-quality subtitles.)

Most of the information is gathered by the filename, that's correct. In case of Sub-Zero there's further media information gathered from Plex's file analysis.

I believe that Sub-Zero's algorithm for choosing which subtitles to use could benefit from subsync's algorithm. This is because, at least conceptually, subsync's synchronization algorithm incorporates a content-based scoring algorithm as a subroutine. The video is transformed into a binary stream for speech and non-speech, respectively. The subtitles are likewise transformed. The idea of subsync is, if video 1's match up with subtitle 1's (resp. 0's), then we can be confident that we have a good synchronization between video and subtitles. We can score the match as (# matching bits minus # mismatched bits) or any other sensible algorithm.

This would actually make sense as it would actually be able to somewhat verify the subtitles SZ downloads. SZ runs as a part of the Plex Server's Python 2.7 environment, though. Which static/dynamic binaries does subsync require to run?

I don't know how Plex will handle the deprecation of 2.7, but I suspect them to ditch the full Python interface instead of actually adapting to 3. So we might be looking at a hard dead-end.

As with subsync, the bottleneck (in terms of processing speed) should be speech extraction performed on the video. The good thing is that this only needs to be done once, after which the content-based scoring can be done on several sets of subtitles.

Yes. This would be a great addition to the analysis - I'd expect that to happen during the Plex Media Analysis, naturally. It might be done by SZ, but I'd see that as a core feature of the already-running analysis.

I think this approach has an advantage of conceptual simplicity and is likely to work on most cases with much less effort compared with a 100% accurate synchronization algorithm. If this is something the Sub-Zero authors think could be useful, I would be happy to provide something in subsync's API for scoring multiple subtitle files and selecting the best one, either at the command line level or in Python (whichever would be easier). I'll defer to @pannal for whether this would make sense -- as the Sub-Zero maintainer I imagine you'll have the best sense for whether the marginal benefit (if any) from this approach is worth the investment on both our ends.

We'd be looking at HDTV vs. HDTV (different sources) and WEB/WEB-DL vs WEBRip. It would be a great addition. There's a looming issue left, though: Subtitle providers don't like us. We tend to spam their APIs and drive their HTTP costs through the roof, so downloading multiple subtitles to match against one another isn't really feasible.

The beauty of SZ (and subliminal for that matter) is that we don't need to actually download any subtitle until it has been matched.

There's no reason to not do that when they've already been fetched, though (comparing extracted embedded subtitles against ones coming from a subtitle provider)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants