Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

meaning of media element currentTime, playback position (etc) with hardware media decoder pipeline #3041

Open
jpiesing opened this issue Sep 14, 2017 · 5 comments

Comments

@jpiesing
Copy link

In the HTML5 media element, the definitions of currentTime, official playback position, current playback position, (etc) are unclear about where in the media decoder pipeline is represented by this value. In the case of devices with a hardware media decoder, the media decoder pipeline may consist of a buffer (perhaps in addition to any MSE SourceBuffer), the media decoder, video enhancement logic, video/graphics composition, more video enhancement logic and in some cases an HDMI interface to a display. At any moment in time, there can be a significant difference between the position on the media timeline of a video frame entering the pipeline at the video decoder buffer and the position on the media timeline of the video being output on the display. If some user agents integrated into some media devices take currentTime (etc) as the media timeline value for the input to the media pipeline and other user agents in other media devices take currentTime (etc) as being the opposite extreme (e.g. an estimate of what an HDMI-connected display is showing) then this will result in a very inconsistent experience.

Closely related to this is the definition of "ended playback" ( https://html.spec.whatwg.org/multipage/media.html#ended-playback ) and the corresponding algorithm (https://html.spec.whatwg.org/multipage/media.html#reaches-the-end). Both of these are defined in terms of the current playback position reaching the end of the media resource. If the current playback position represents the input to the media pipeline then 'ended playback' and the algorithm will apply well before the last video and audio has been output.

Once place where this can cause problems is with video ads. These are typically played to the end before playing the next ad (or playing the content). Pages will typically wait for an ended event and play the next content. If the current playback position represents the input to the media pipeline then it's possible that triggering off the ended event might even truncate the end of the ad. If the current playback position represents the output from the media pipeline then it's possible there could be a significant

Where in a media decoder pipeline are currentTime, current playback position (etc) supposed to be measured?

If it's intentionally ambiguous, is there a reason why a spec addressing the integration of user agents onto media devices could not be more precise?

@annevk
Copy link
Member

annevk commented Sep 14, 2017

cc @whatwg/media

@jyavenard
Copy link
Member

I'm rather confused with your use of terms:
To me, the decoding pipeline is about decoding, that is converting a compressed data (video or audio) and getting something uncompressed out (either a PCM audio sample, or a RGB/YUV picture frame)
What your machine is connected to, the type of screen or speakers isn't part of the decoding pipeline

Firefox currentTime is the time of the audio playing now, as reported by the audio card on what can be now heard, latency is taken into account. Hardware decoding video or not is totally irrelevant.

If there's no audio track, an audio clock is simulated. In which case, decoded pictures are sent to the compositor as the start time of the picture match the clock. A frame is displayed until there's a new one ready to be displayed at the right time. If there's a latency between the time the picture is sent to the graphic card to when it's actually be displayed isn't taken into account (as there's typically no way to retrieve that information anyway)

In any case, the decoding pipeline is of no relevance, it's always when a sample is composited (either audio or video) that matters.

As such, you can't receive the "ended" event until the video has actually finished playing and the last audio frames have been heard and there's now silence.
The condition you describe cannot happen.

I'm fairly sure it's the same behaviours for all other user agents.

@jpiesing
Copy link
Author

@jyavenard Thanks for the quick reply.

I'm rather confused with your use of terms:

Sorry if I caused confusion.

To me, the decoding pipeline is about decoding, that is converting a compressed data (video or audio) and getting something uncompressed out (either a PCM audio sample, or a RGB/YUV picture frame)
What your machine is connected to, the type of screen or speakers isn't part of the decoding pipeline

Yes but it all contributes to the delay/latency between the media time measured in the UA and the media time of the video the user is seeing and the audio they're hearing.

Firefox currentTime is the time of the audio playing now, as reported by the audio card on what can be now heard, latency is taken into account. Hardware decoding video or not is totally irrelevant.

Thank you for that useful piece of information. What does "playing now" mean? Do you mean what the user should actually be hearing?

If there's no audio track, an audio clock is simulated. In which case, decoded pictures are sent to the compositor as the start time of the picture match the clock. A frame is displayed until there's a new one ready to be displayed at the right time. If there's a latency between the time the picture is sent to the graphic card to when it's actually be displayed isn't taken into account (as there's typically no way to retrieve that information anyway)

In any case, the decoding pipeline is of no relevance, it's always when a sample is composited (either audio or video) that matters.

Please forgive me if this is a stupid question but how is audio/video sync achieved here if the delay in the video pipeline is longer than the delay in the audio?

As such, you can't receive the "ended" event until the video has actually finished playing and the last audio frames have been heard and there's now silence.
The condition you describe cannot happen.

If's not remotely obvious from the spec that currentTime and current playback position relate to the content that is actually being head/seen by the user.

I'm fairly sure it's the same behaviours for all other user agents.

@jyavenard
Copy link
Member

jyavenard commented Sep 14, 2017

Thank you for that useful piece of information. What does "playing now" mean? Do you mean what the user should actually be hearing?

yes.

Please forgive me if this is a stupid question but how is audio/video sync achieved here if the delay in the video pipeline is longer than the delay in the audio?

those are unheard of (pun intended)... video typically has no delay.

There may be some exceptions like some video output that re-encode and send to a remote device. But there's typically no API to determine what the latency on a video output would be.

A/V sync would be broken, if the audio was taking a different playback path.
The typical approach for those case is to also transport the audio at the same time.
If A/V sync is good locally, then it will also be good remotely assuming the delay between A and V stay the same.

If's not remotely obvious from the spec that currentTime and current playback position relate to the content that is actually being head/seen by the user.

But currentTime isn't the current playback position, it is the official playback position, which is an approximation of the current playback position, designed to be stable within the current event loop ((https://html.spec.whatwg.org/multipage/media.html#official-playback-position) )
From spec:
"The currentTime attribute must, on getting, return the media element's default playback start position, unless that is zero, in which case it must return the element's official playback position. "

So it's not meant to be 100% accurate.
In any case, "ended" should be fired when there's nothing more to play, it is "The current playback position is the end of the media resource"
it seems obvious to me that the current playback position is what the user is currently seeing/hearing. It is current after all.

If ended was to be fired before the user had seen or heard the last audio/video, I would consider it a bug.

But of course, that is subject to opinion. But I'm not sure it can be described in a better way in the text.
My $0.02

@jpiesing
Copy link
Author

Please forgive me if this is a stupid question but how is audio/video sync achieved here if the delay in the video pipeline is longer than the delay in the audio?
those are unheard of (pun intended)... video typically has no delay.

Checking with people who understand this better than me, in TV sets there would typically be a delay of between 20 and 160ms between the output of the video decoder and the video reaching the display. The size of the delay would depend on user preferences, the content type and other factors. The TV set will delay the audio to match this video delay, when rendering the audio on its speakers or headphones. We believe media devices without an integrated display will also have (perhaps slightly smaller?) delays between the output of the media decoder(s) and outputting the video via something like HDMI. Note that the HDMI spec allows a display to provide a video latency back to a media device. It's also possible for a display to provide a different audio latency to the media device but apparently nobody does.

it seems obvious to me that the current playback position is what the user is currently seeing/hearing. It is current after all.

In the context of a UA integrated into a media device, I can see at least the following reasonable interpretations of "current playback position" as well as "what the user is seeing and hearing":

  1. The time on the output of the video decoder (i.e. after all picture re-ordering etc). This is probably the simplest for someone integrating a UA into a media device.

  2. The time where video and graphics are combined (i.e. after any video-specific picture processing/improvement). This is easy to test. It's a little more complex for an integrator than Fix typo "master repository" in README (it's a branch) #1 - and may only be possible to estimate and not return precisely. For example, it may be necessary to measure the delay for each set of user preferences/content types (mentioned above) and offset the time on the output of the video decoder (as in Fix typo "master repository" in README (it's a branch) #1) depending on which preferences & content type apply.

I can also see other possible interpretations of "current playback position" which which others might find reasonable but I would not.

Many people integrating a UA onto a media device will choose whatever interpretation is easiest for them to implement unless something is clearly and unambiguously specified.

In any case, "ended" should be fired when there's nothing more to play, it is "The current playback position is the end of the media resource"

If "current playback position" is open to interpretations that may differ by up to 160ms then the timing of when ended is generated may be different by up to 160ms between different devices - potentially between different devices using the same UA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants