-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
meaning of media element currentTime, playback position (etc) with hardware media decoder pipeline #3041
Comments
cc @whatwg/media |
I'm rather confused with your use of terms: Firefox currentTime is the time of the audio playing now, as reported by the audio card on what can be now heard, latency is taken into account. Hardware decoding video or not is totally irrelevant. If there's no audio track, an audio clock is simulated. In which case, decoded pictures are sent to the compositor as the start time of the picture match the clock. A frame is displayed until there's a new one ready to be displayed at the right time. If there's a latency between the time the picture is sent to the graphic card to when it's actually be displayed isn't taken into account (as there's typically no way to retrieve that information anyway) In any case, the decoding pipeline is of no relevance, it's always when a sample is composited (either audio or video) that matters. As such, you can't receive the "ended" event until the video has actually finished playing and the last audio frames have been heard and there's now silence. I'm fairly sure it's the same behaviours for all other user agents. |
@jyavenard Thanks for the quick reply.
Sorry if I caused confusion.
Yes but it all contributes to the delay/latency between the media time measured in the UA and the media time of the video the user is seeing and the audio they're hearing.
Thank you for that useful piece of information. What does "playing now" mean? Do you mean what the user should actually be hearing?
Please forgive me if this is a stupid question but how is audio/video sync achieved here if the delay in the video pipeline is longer than the delay in the audio?
If's not remotely obvious from the spec that currentTime and current playback position relate to the content that is actually being head/seen by the user.
|
yes.
those are unheard of (pun intended)... video typically has no delay. There may be some exceptions like some video output that re-encode and send to a remote device. But there's typically no API to determine what the latency on a video output would be. A/V sync would be broken, if the audio was taking a different playback path.
But currentTime isn't the current playback position, it is the official playback position, which is an approximation of the current playback position, designed to be stable within the current event loop ((https://html.spec.whatwg.org/multipage/media.html#official-playback-position) ) So it's not meant to be 100% accurate. If ended was to be fired before the user had seen or heard the last audio/video, I would consider it a bug. But of course, that is subject to opinion. But I'm not sure it can be described in a better way in the text. |
Checking with people who understand this better than me, in TV sets there would typically be a delay of between 20 and 160ms between the output of the video decoder and the video reaching the display. The size of the delay would depend on user preferences, the content type and other factors. The TV set will delay the audio to match this video delay, when rendering the audio on its speakers or headphones. We believe media devices without an integrated display will also have (perhaps slightly smaller?) delays between the output of the media decoder(s) and outputting the video via something like HDMI. Note that the HDMI spec allows a display to provide a video latency back to a media device. It's also possible for a display to provide a different audio latency to the media device but apparently nobody does.
In the context of a UA integrated into a media device, I can see at least the following reasonable interpretations of "current playback position" as well as "what the user is seeing and hearing":
I can also see other possible interpretations of "current playback position" which which others might find reasonable but I would not. Many people integrating a UA onto a media device will choose whatever interpretation is easiest for them to implement unless something is clearly and unambiguously specified.
If "current playback position" is open to interpretations that may differ by up to 160ms then the timing of when ended is generated may be different by up to 160ms between different devices - potentially between different devices using the same UA. |
In the HTML5 media element, the definitions of currentTime, official playback position, current playback position, (etc) are unclear about where in the media decoder pipeline is represented by this value. In the case of devices with a hardware media decoder, the media decoder pipeline may consist of a buffer (perhaps in addition to any MSE SourceBuffer), the media decoder, video enhancement logic, video/graphics composition, more video enhancement logic and in some cases an HDMI interface to a display. At any moment in time, there can be a significant difference between the position on the media timeline of a video frame entering the pipeline at the video decoder buffer and the position on the media timeline of the video being output on the display. If some user agents integrated into some media devices take currentTime (etc) as the media timeline value for the input to the media pipeline and other user agents in other media devices take currentTime (etc) as being the opposite extreme (e.g. an estimate of what an HDMI-connected display is showing) then this will result in a very inconsistent experience.
Closely related to this is the definition of "ended playback" ( https://html.spec.whatwg.org/multipage/media.html#ended-playback ) and the corresponding algorithm (https://html.spec.whatwg.org/multipage/media.html#reaches-the-end). Both of these are defined in terms of the current playback position reaching the end of the media resource. If the current playback position represents the input to the media pipeline then 'ended playback' and the algorithm will apply well before the last video and audio has been output.
Once place where this can cause problems is with video ads. These are typically played to the end before playing the next ad (or playing the content). Pages will typically wait for an ended event and play the next content. If the current playback position represents the input to the media pipeline then it's possible that triggering off the ended event might even truncate the end of the ad. If the current playback position represents the output from the media pipeline then it's possible there could be a significant
Where in a media decoder pipeline are currentTime, current playback position (etc) supposed to be measured?
If it's intentionally ambiguous, is there a reason why a spec addressing the integration of user agents onto media devices could not be more precise?
The text was updated successfully, but these errors were encountered: