Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operation Point Reyes (and WPT throttling history) #9887

Closed
3 of 6 tasks
paulirish opened this issue Oct 24, 2019 · 24 comments
Closed
3 of 6 tasks

Operation Point Reyes (and WPT throttling history) #9887

paulirish opened this issue Oct 24, 2019 · 24 comments

Comments

@paulirish
Copy link
Member

paulirish commented Oct 24, 2019

Harkening to the days of Operation Yaquina Bay, we've got a new challenge in front of us...

Point Reyes is the windiest place on the Pacific Coast. And much like wind makes the physical world oscillate, variance makes our numbers vibrate.
image

We have a few questions we need answered to get our Lantern-driven simulation in tip-top shape.

Questions

  • What happened to the error rate?
    • Potential causes: OOPIF, NetworkService, ??
  • Can we collect WPT mobile device data faster?
  • Did the HttpArchive runs (we use for scoring curve calibration) get affected by cpu throttling?

Actions

  • Collect golden WPT data from Moto G4 henceforth [@connorjclark]
  • Change LH's default from Nexus 5X to Moto G4. [@connorjclark]
    • Update UA, Update UA in LR, update device metrics (?)
    • Add device art to DevTools
  • Regather lantern unthrottled-assets traces with oopif disabled [@connorjclark]
    • Investigate error rate (return to level of old traces?) [@patrickhulce helping]
  • Investigate fidelity of WPT recordings [@patrickhulce]
  • Verify CPU throttling story on WPT [@brendankenny ]
  • Determine android memory threshold situation for OOPIF. [@connorjclark]
    • Hard cutoff? Plans? Nexus 5X and MotoG4 have the same characteristics here?

Team, please update this with anything it's missing.

@connorjclark
Copy link
Collaborator

We're gonna explore adding the ability to disable OOPIFs at runtime via the protocol. We'd only do this for mobile runs. This will allow us to keep lantern the same and still get accurate results for mobile. But first, let's confirm that OOPIFs are a source of error in (...traces with oopif disabled task above).

@connorjclark
Copy link
Collaborator

connorjclark commented Oct 25, 2019

For comparison, this is the accuracy of the dataset the lantern test has today:

image

With newly corrected traces:

With OOPIFs on local desktop:

image

Without OOPIFs on local desktop:

image

Data sets: https://drive.google.com/file/d/1-3IjtVsllDgcSsY0S49S1ZeTd1p13-IH/view?usp=sharing

@patrickhulce
Copy link
Collaborator

So definitely better for TTI, but not nearly enough :/ I'm gonna start looking into the FCP traces here.

@brendankenny
Copy link
Member

this seems implicit in the above, but just to be explicit, is the plan to

  • pick a new mobile device (Moto G4)
  • base mobile emulation on whatever it actually does, e.g. OOPIF behavior

vs simply tune numbers based on the times alone, presumably because the dependency graph would be so completely different as to be useless for predicting the mobile speed? e.g. lots of long tasks from iframes.

That does mean we're going to have to stay on our toes for more site isolation changes. And for

We're gonna explore adding the ability to disable OOPIFs at runtime via the protocol

I wonder if there's a way on desktop we could get the password-triggered mobile behavior that android is getting if the G4 gets it or gets it someday soon.

Or should we just force site isolation off for the mobile device WPT runs as well, assuming WPT can do that? That might better simulate lower powered devices, since they're unlikely to get site isolation any time soon, and you'd get more main thread blocking.

@patrickhulce
Copy link
Collaborator

Oh wait, @connorjclark these are the same WPT traces from earlier, right?

So all the FCP issues discussed in #9662 (comment) and #9662 (comment) apply here.

@connorjclark
Copy link
Collaborator

connorjclark commented Oct 25, 2019

Or should we just force site isolation off for the mobile device WPT runs as well, assuming WPT can do that? That might better simulate lower powered devices, since they're unlikely to get site isolation any time soon, and you'd get more main thread blocking.

The G4 and N5 are under the threshold regardless, so no need to intervene. But yes, it's possible (wpt has a cmdline option)

Oh wait, @connorjclark these are the same WPT traces from earlier, right?

Yeah, the WPT numbers are still from the N5 run. I can try getting G4 numbers tomorrow and see if that changes anything.

@brendankenny
Copy link
Member

brendankenny commented Oct 25, 2019

The M4 and N5 are under the threshold regardless, so no need to intervene

"While we investigate how to bring this support to more devices..."

I meant for better (near) future proofing but good to know for the present.

@connorjclark
Copy link
Collaborator

Ran WPT again for a subset of urls but for G4. n=5, chose median.

Note: I changed the WPT Chrome channel to Stable. But apparently it takes many weeks for Stable to release to Play Store, so I actually got M77. Hence the lack of LCP.

OOPIFs:

image

No OOPIFs:

image

@patrickhulce
Copy link
Collaborator

patrickhulce commented Oct 28, 2019

Suggested Action Items / tl;dr

  • Use the resolved URL after redirects
  • Compare the complete URL sets
  • Use the minimum observed WPT value for each metric instead of the median-TTI-based one1
  • Re-adjust expectations that our success metrics will not be quite as good given the original set was whittled down from 1000, we've switched devices, and allowed 2 years of Android + Chrome updates to change underneath us.

Details

Redirects

We have a few URLs that are redirecting. Because of the existing discrepancy between how observed and simulated are calculated that I believe is being fixed for 6.0 (#8984) we need to use the resolved URL after redirects. Probably my bad for saving the golden set with the clean URL instead of the URL that was audited, sorry :/
- https://www.blogger.com/ https://www.blogger.com/about/?r=1-null_user
- https://www.douban.com/ https://m.douban.com/
- https://www.facebook.com/ https://m.facebook.com/?refsrc=https%3A%2F%2Fwww.facebook.com%2F&_rdr
- https://www.hexun.com/ https://m.hexun.com/
- https://www.iciba.com/ http://m.iciba.com/
- https://www.imageshack.us/ https://www.imageshack.us/login
- https://www.marketgid.com/ https://www.mgid.com/ru
- https://www.mop.com/ https://m.mop.com/

Subset Comparison

It seems like we're only comparing a subset of the same URLs (32/99 with my fixUrl script, it looks like connor managed to match up 43 though). When the basket of URLs is significantly smaller than the original already small subset, it's difficult to compare error charateristics reliably. Lots of these URLs are in connor's list in the PR though, so I'm not sure if they weren't run just for the sake of time. The same redirect resolution point above will need to be applied to these too. That being said, after adjusting our script for just the ones in the basket, our original error rates on this subset was higher but not that much higher (in fact for TTFCPUI it was actually lower than normal), so this doesn't totally explain error rate differences.

image

- https://flipkart.com not found in new set
- https://vine.co/ not found in new set
- https://weather.com/ not found in new set
- http://www.4399.com/ not found in new set
- http://www.58.com/ not found in new set
- http://www.7k7k.com/ not found in new set
- http://www.amazon.co.jp/ not found in new set
- http://www.blogspot.com/ not found in new set
- http://www.brothersoft.com/ not found in new set
- http://www.china.com.cn/ not found in new set
- http://www.cntv.cn/ not found in new set
- http://www.conduit.com/ not found in new set
- http://www.craigslist.org/ not found in new set
- http://www.dawn.com/ not found in new set
- http://www.dion.ne.jp/ not found in new set
- http://www.ebay.com/ not found in new set
- http://www.espn.com/ not found in new set
- http://www.fc2.com/ not found in new set
- http://www.filestube.com/ not found in new set
- http://www.getpersonas.com/ not found in new set
- http://www.globo.com/ not found in new set
- http://www.hatena.ne.jp/ not found in new set
- http://www.hotfile.com/ not found in new set
- http://www.hp.com/ not found in new set
- http://www.huffingtonpost.com/ not found in new set
- http://www.hulu.com/ not found in new set
- http://www.java.com/ not found in new set
- http://www.livedoor.jp/ not found in new set
- http://www.liveperson.net/ not found in new set
- http://www.maktoob.com/ not found in new set
- http://www.metrolyrics.com/ not found in new set
- http://www.mlb.com/ not found in new set
- http://www.mozilla.org/ not found in new set
- http://www.optmd.com/ not found in new set
- http://www.orange.fr/ not found in new set
- http://www.orkut.com/ not found in new set
- http://www.partypoker.com/ not found in new set
- http://www.pcpop.com/ not found in new set
- http://www.pdfqueen.com/ not found in new set
- http://www.pptv.com/ not found in new set
- http://www.rakuten.co.jp/ not found in new set
- http://www.rakuten.ne.jp/ not found in new set
- http://www.scribd.com/ not found in new set
- http://www.shopping.com/ not found in new set
- http://www.skype.com/ not found in new set
- http://www.so-net.ne.jp/ not found in new set
- http://www.softonic.com/ not found in new set
- http://www.sogou.com/ not found in new set
- http://www.soso.com/ not found in new set
- http://www.symantec.com/ not found in new set
- http://www.t-online.de/ not found in new set
- http://www.tabelog.com/ not found in new set
- http://www.thefreedictionary.com/ not found in new set
- http://www.thepiratebay.org/ not found in new set
- http://www.thestar.com.my not found in new set
- http://www.tianya.cn/ not found in new set
- http://www.torrentz.com/ not found in new set
- http://www.tumblr.com/ not found in new set
- http://www.twitpic.com/ not found in new set
- http://www.typepad.com/ not found in new set
- http://www.verizonwireless.com/ not found in new set
- http://www.vevo.com/ not found in new set
- http://www.weather.com/ not found in new set
- http://www.wikipedia.org/ not found in new set
- http://www.ynet.com/ not found in new set
- http://www.youdao.com/ not found in new set
- http://www.zol.com.cn/ not found in new set

Unreproducable WPT Variance

This is likely selection bias because we're looking into sites that performed particularly poorly partly by chance. When we try to reproduce those results, we just get the more reasonable results. We probably didn't experience much of this in the first set because we determined the 100 golden set based on a much larger 1000 set so runs with unpredictable behavior and high variance were just excluded. AT&T is a good example of this where our golden value is 18.5s yet the median of 9 runs redone through WPT UI is 10s. More on this below.

Lantern Intentionally Makes Optimistic Decisions But We Measure Against Median

The worst error rates that aren't redirect-driven are all dramatic underestimations. We've made lots of improvements and decisions over the past two years that make lantern intentionally optimistic. Even in our pessimistic simulations we use optimistic per-origin RTTs, optimistic server response times, and optimistic SSL and HTTP/2 setups. The decision to exclude highly variable runs from the initial golden set is likely responsible for providing a rosier outlook on the difference between predicting the median vs. predicting the minimum than was realistically achievable.

image

As a result, we're systematically underestimating. Explanations for the top couple of errors found below.

  1. https://www.att.com - golden says 18.5s, WPT traces say 10.5s-20s, we say 7s. Here we get unlucky that the median TTI is actually one of the longest FCPs, so it's not a median-median comparison. Perhaps we should be using a different median selector? Median FCP might have more reasonable network characteristics for all metrics and TTI is so variable anyhow.
  2. https://www.56.com - golden says 7.5s, WPT traces say 6-11s, we say 3s. Systematic and intentional lantern error. Here we are bitten by our intentional optimism in per-origin RTTs. The servers are in APAC which result in very high and very variable RTTs. We use the min (and plan to not observe it at all), so we greatly underestimate the time taken to download required resources.
  3. https://www.deviantart.com/ - golden says 8.5s, WPT traces say 5.1-8.5s, we say 3.5s. Here we again get unlucky that the median TTI is actually the max FCP. Extremely variable root document request that sometimes takes ~10x the lantern estimate.
  4. https://www.linkedin.com/ - golden says 4.1s, WPT traces say 2-5.1s, we say 1.9s. Here we again get unlucky that the median TTI is actually the max FCP. Extremely variable root document request that sometimes takes ~5x the lantern estimate.

1 = either we track our accuracy against what we're explicitly attempting to simulate or we change our simulation to inject more pessimism. I strongly dislike the latter approach but am happy to discuss if you disagree :)

@connorjclark
Copy link
Collaborator

connorjclark commented Oct 28, 2019

Re: Redirects

I did correct many of the redirects, but I did not take into account m.* redirects based on UA. Good catch.

It seems like we're only comparing a subset of the same URLs (32/99 with my fixUrl script, it looks like connor managed to match up 43 though). When the basket of URLs is significantly smaller than the original already small subset, it's difficult to compare error charateristics reliably. Lots of these URLs are in connor's list in the PR though, so I'm not sure if they weren't run just for the sake of time. The same redirect resolution point above will need to be applied to these too. That being said, after adjusting our script for just the ones in the basket, our original error rates on this subset was higher but not that much higher (in fact for TTFCPUI it was actually lower than normal), so this doesn't totally explain error rate differences.

I definitely did a subset in the interest of time. I attempted to make sure the sets were equivalent, but I was admittedly a bit cavalier.

When we try to reproduce those results, we just get the more reasonable results. We probably didn't experience much of this in the first set because we determined the 100 golden set based on a much larger 1000 set so runs with unpredictable behavior and high variance were just excluded.

I didn't know about the larger 1000 URL set. So it was whittled down based on just variance? We should definitely do something similar again, or at least measure the variance in our existing set of URLs and see what bad apples we have.

either we track our accuracy against what we're explicitly attempting to simulate or we change our simulation to inject more pessimism

Taking the minimum seems like maximum optimism. Would you described lantern with the same level of optimism?

@connorjclark
Copy link
Collaborator

connorjclark commented Oct 28, 2019

For making collection easier, I would like to transition to a cloud-based operation. (it is very time consuming rn. especially since my corp machine falls asleep so I have to babysit it). I don't exactly know how to go about that. @patrickhulce any ideas? Probably the same approach you did with DZL (whatever that was).

In addition to convenience, it'd be a necessary step to automating the collection on a somewhat-regular basis.

@patrickhulce
Copy link
Collaborator

So it was whittled down based on just variance?

More or less randomness + variance. If I had thought ahead I would have been much more methodical about it :) Basically I randomly selected 110% of the URLs we wanted to keep in a basket and threw out the worst 10% that had very high variance.

Taking the minimum seems like maximum optimism. Would you described lantern with the same level of optimism?

Given a fixed graph and a move to totally ignore observed per-origin RTT, yeah I would say that the characteristics of our simulation produce the maximally optimistic result. The only thing pushing results to be more pessimistic is the pessimistic graph that includes things that potentially shouldn't be included, but in many cases there simply aren't any such things to include and so we remain maximally optimistic. If min is too extreme, something like the 25th or 10th percentile might make sense. Part of the motivation here is that there's frequently a bimodal distribution and by choosing the TTI-median we end up comparing ourselves with something that is absolute worst-case scenario, which lantern is simply never going to try to match.

For making collection easier, I would like to transition to a cloud-based operation.

I agree though this will likely come with it's own subtly different perf characteristics too to complicate things 😞

@connorjclark
Copy link
Collaborator

I agree though this will likely come with it's own subtly different perf characteristics too to complicate things

If we used PSI / LR (for the "unthrottled desktop" runs), at least we'd be accounting for things that will possibly be meaningful :)

@patrickhulce
Copy link
Collaborator

@patrickhulce any ideas?

ya, if the lantern collection script is G2G as-is I can put a script together for automating it on there and dumping results to cloud storage

@patrickhulce
Copy link
Collaborator

patrickhulce commented Oct 28, 2019

If we used PSI / LR (for the "unthrottled desktop" runs), at least we'd be accounting for things that will possibly be meaningful

Ooooh this is a great idea! Any change of exposing a "get trace and devtools log" to public PSI?

@connorjclark
Copy link
Collaborator

Not public, but we have the capability to grab that stuff if we hit LR internally. So it'd amount to running the collection script on borg (+ hitting the LR api, which is easy enough).

@connorjclark
Copy link
Collaborator

quick summary of more action items we decided today:

  • pick 75 %ile (near-best) WPT run based on TTI (of 9 runs, thats the third best)
  • @patrickhulce will get collection process working on GCP
  • @connorjclark one day we'll do the same for LR, which will be useful for making sure that env is good too
  • @connorjclark continue developing lantern LCP separate from updating trace collection

@patrickhulce
Copy link
Collaborator

pick 75 %ile (near-best) WPT run based on TTI (of 9 runs, thats the third best)

I'd be very curious to see how many metrics change if we base this on FCP instead. TTI is so variable anyway I have a hunch that it's less indicative for the other metrics.

@connorjclark
Copy link
Collaborator

I'd be very curious to see how many metrics change if we base this on FCP instead. TTI is so variable anyway I have a hunch that it's less indicative for the other metrics.

sgtm. let's try varying that next time we do a full collection.

@connorjclark
Copy link
Collaborator

connorjclark commented Jan 15, 2020

This PR added the collection scripts: #9662

Just realized we are selecting the golden run based on 75pct FCP. Apparently we want to use TTI instead (#9887 (comment)). Gotta fix that.

@paulirish
Copy link
Member Author

paulirish commented Jan 31, 2020

CPU throttling on WPT braindump

My understanding of CPU throttling in WPT:

  • There's the desktop_browser class and the devtools_browser class. Both are used in the case of chrome_desktop, which is what HA is using.
  • self.options are global for the wpt instance. job is a WPT run. within the run there may be "first view and repeat view". those individual loads are tasks. The lighthouse run is a job, though there will then be a LH task within it.
  • using --throttling on the WPT instance will allow it to use cgroup CPU throttling, but only if each specific job also wants throttling.
    • if --throttling was not set a but a mobile-emulation (on desktop host) job wants throttling, WPT uses devtools CPU throttling.
  • pat's fix made sure that CPU throttling (via devtools) is flipped on for Lighthouse runs/tasks.

status as of jan 2020:

  • Reading the code it seems like LH runs would be double throttled. :/
  • Running WPT now, WPT tasks are still 2.5x longer than LH tasks. This is unexpected. I've asked pat to look into it.

Updated Timeline of changes

Summarizing as good/bad months...

  • before sept 2019: wrong (lacking)
  • from oct 2019 to may 2020: good-ish 👍 (though different than before, also quantized)
  • june + july 2020: wrong (doubled)
  • aug 2020: wrong (bad config)
  • sept 2020: good 👍

@connorjclark
Copy link
Collaborator

  1. Lantern doesn't know about OOPIFs, which might affect the simulation. Is that captured anywhere?
  2. Paul, did you resolve your things w/ WPT?
  3. I think Moto G4 still has OOPIF disabled, FWIW
  4. one day we'll do the same for LR, which will be useful for making sure that env is good too -> Should we do this?

@patrickhulce
Copy link
Collaborator

patrickhulce commented Aug 4, 2020

Lantern doesn't know about OOPIFs, which might affect the simulation. Is that captured anywhere?

There's a document outlining our plans here and we'll be converting it to issues shortly. doc

Paul, did you resolve your things w/ WPT?

Update in #9887 (comment) but tl;dr some things are fixed but not everything looks right still.

one day we'll do the same for LR, which will be useful for making sure that env is good too -> Should we do this?

tracked by #10358

@patrickhulce
Copy link
Collaborator

The variance mission will never end, but we're done with this as a standalone effort for now.

@paulirish paulirish changed the title Operation Point Reyes (OPR) Operation Point Reyes (and WPT throttling history) May 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants