Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(content): Fix buggy timing metrics #14662

Merged
merged 1 commit into from
Dec 19, 2022
Merged

fix(content): Fix buggy timing metrics #14662

merged 1 commit into from
Dec 19, 2022

Conversation

dschom
Copy link
Contributor

@dschom dschom commented Dec 19, 2022

Because

  • We could see in Sentry that various times provided to metrics endpoint were consistently off.

This pull request

  • Creates an abstraction around the performance API to isolate some of its complexities.
  • Tries to use the performance API where possible.
  • Address discrepancies between performance API timestamps and system time.. The assumption that the clock used by Date and the clock used by the performance API are somehow in sync is likely the reason for the generation of erroneous data. It is very likely that there is a significant clock skew found between the monotonic clock used by the performance API and current state of the system clock. There appears to be a lot of nuance here, and the exact way this plays out depends on the OS, browser, and browser version, and if the machine has been put into sleep mode. One thing is clear, mixing the performance API timestamps and Date timestamps appears to not work very well.
  • Adds support for using L2 timings, and uses these timings when possible.
  • Adds a performance fallback class that can fill in for situations where the performance API is missing.
    • Adds some logic around timing values that should be ignored when set to 0.
    • Prefers the performance API's clock when possible, since it’s resilient to skewed metrics due to a computer being put to sleep.
  • For browser’s that do not support the performance api, we will not produce timing data.
  • For browser’s that do not support the performance api, we will make a best effort to produce timing data; however, if we detect the machine enters sleep mode during data collection, the data will be deemed unreliable and will not be recorded.

Issue that this pull request solves

Closes: FXA-5758

Checklist

Put an x in the boxes that apply

  • My commit is GPG signed.
  • If applicable, I have modified or added tests which pass locally.
  • I have added necessary documentation (if appropriate).
  • I have verified that my changes render correctly in RTL (if appropriate).

Other information (Optional)

This commit partially addresses FXA-2861, but not fully. Perhaps now that this in place that becomes a won't fix since these changes hopefully suffice until the react rewrite is complete. It should be noted that L2 and L1 timings are very similar in structure. It was pretty straight forward to map an L2 timing into the data fields required for the metrics API.

@dschom dschom requested a review from a team as a code owner December 19, 2022 17:28
Copy link
Contributor

@chenba chenba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I like the approach. The idea that computer sleep affects this is interesting. The performance timing metrics are immediately flushed upon successful loading of the page, so intuitively it feels unlikely.

Because:
- We could see in Sentry that various times provided to metrics endpoint were consistently off.

This Commit:
- Creates an abstraction around the performance API to isolate some of its complexities.
 - Tries to use the performance API where possible.
- Address discrepancies between performance API timestamps and system time.. The assumption that the clock used by Date and the clock used by the performance API  are somehow in sync is likely the reason for the generation of erroneous data.  It is very likely that there is a significant clock skew found between the monotonic clock used by the performance API and current state of the system clock. There appears to be a lot of nuance here, and the exact way this plays out depends on the OS, browser, and browser version, and if the machine has been put into sleep mode. One thing is clear, mixing the performance API timestamps and Date timestamps appears to not work very well.
- Adds support for using L2 timings, and uses these timings when possible.
- Adds a performance fallback class that can fill in for situations where the performance API is missing.
  - Adds some logic around timing values that should be ignored when set to 0.
  - Prefers the performance API's clock when possible, since it’s resilient to skewed metrics due to a computer being put to sleep.
- For browser’s that do not support the performance api, we will not produce timing data.
- For browser’s that do not support the performance api, we will  make a best effort to produce timing data; however, if we detect the machine enters sleep mode during data collection, the data will be deemed unreliable and will not be recorded.
@dschom
Copy link
Contributor Author

dschom commented Dec 19, 2022

I like the approach. The idea that computer sleep affects this is interesting. The performance timing metrics are immediately flushed upon successful loading of the page, so intuitively it feels unlikely.

@chenba Agreed, sleep metrics being affected between a page load and metrics flush is indeed unlikely and probably results in a tiny fraction of the errors we see.

What I think might be a big source of error, however, is variability in how navigationStart is calculated on different systems. I think as long as we don’t assume navigationStart was based on Date.now(), we might be in the clear. I guess only testing in the wild will really tell since it’s hard to reproduce this locally.

In the jira ticket I cite some very similar reports where people have reported drift in navigationStart values. This issue shows inconsistent behavior for certain systems and the relationships between navigationStart, timeOrigin, and Date.now(). And this post sounds very similar to what we are seeing in our error logs.

By the way, if you want to see the monotonic clock drift (as it should), run this after a fresh restart. Put the machine to sleep and run it again.

performance.now() + performance.timeOrigin - Date.now()

Interestingly enough on my system, navigationStart isn’t affected by this drift, i.e.navigationStart always maps to Date.now() when the event fires, as it should per the L1 spec.

@dschom dschom merged commit 829862c into main Dec 19, 2022
@dschom dschom deleted the FXA-5758 branch December 19, 2022 22:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants