-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Operation Point Reyes (and WPT throttling history) #9887
Comments
We're gonna explore adding the ability to disable OOPIFs at runtime via the protocol. We'd only do this for mobile runs. This will allow us to keep lantern the same and still get accurate results for mobile. But first, let's confirm that OOPIFs are a source of error in ( |
For comparison, this is the accuracy of the dataset the lantern test has today: With newly corrected traces: With OOPIFs on local desktop: Without OOPIFs on local desktop: Data sets: https://drive.google.com/file/d/1-3IjtVsllDgcSsY0S49S1ZeTd1p13-IH/view?usp=sharing |
So definitely better for TTI, but not nearly enough :/ I'm gonna start looking into the FCP traces here. |
this seems implicit in the above, but just to be explicit, is the plan to
vs simply tune numbers based on the times alone, presumably because the dependency graph would be so completely different as to be useless for predicting the mobile speed? e.g. lots of long tasks from iframes. That does mean we're going to have to stay on our toes for more site isolation changes. And for
I wonder if there's a way on desktop we could get the password-triggered mobile behavior that android is getting if the G4 gets it or gets it someday soon. Or should we just force site isolation off for the mobile device WPT runs as well, assuming WPT can do that? That might better simulate lower powered devices, since they're unlikely to get site isolation any time soon, and you'd get more main thread blocking. |
Oh wait, @connorjclark these are the same WPT traces from earlier, right? So all the FCP issues discussed in #9662 (comment) and #9662 (comment) apply here. |
The G4 and N5 are under the threshold regardless, so no need to intervene. But yes, it's possible (wpt has a
Yeah, the WPT numbers are still from the N5 run. I can try getting G4 numbers tomorrow and see if that changes anything. |
I meant for better (near) future proofing but good to know for the present. |
Ran WPT again for a subset of urls but for G4. n=5, chose median. Note: I changed the WPT Chrome channel to OOPIFs: No OOPIFs: |
Suggested Action Items / tl;dr
DetailsRedirectsWe have a few URLs that are redirecting. Because of the existing discrepancy between how observed and simulated are calculated that I believe is being fixed for 6.0 (#8984) we need to use the resolved URL after redirects. Probably my bad for saving the golden set with the clean URL instead of the URL that was audited, sorry :/ Subset ComparisonIt seems like we're only comparing a subset of the same URLs (32/99 with my fixUrl script, it looks like connor managed to match up 43 though). When the basket of URLs is significantly smaller than the original already small subset, it's difficult to compare error charateristics reliably. Lots of these URLs are in connor's list in the PR though, so I'm not sure if they weren't run just for the sake of time. The same redirect resolution point above will need to be applied to these too. That being said, after adjusting our script for just the ones in the basket, our original error rates on this subset was higher but not that much higher (in fact for TTFCPUI it was actually lower than normal), so this doesn't totally explain error rate differences.
Unreproducable WPT VarianceThis is likely selection bias because we're looking into sites that performed particularly poorly partly by chance. When we try to reproduce those results, we just get the more reasonable results. We probably didn't experience much of this in the first set because we determined the 100 golden set based on a much larger 1000 set so runs with unpredictable behavior and high variance were just excluded. AT&T is a good example of this where our golden value is 18.5s yet the median of 9 runs redone through WPT UI is 10s. More on this below. Lantern Intentionally Makes Optimistic Decisions But We Measure Against MedianThe worst error rates that aren't redirect-driven are all dramatic underestimations. We've made lots of improvements and decisions over the past two years that make lantern intentionally optimistic. Even in our pessimistic simulations we use optimistic per-origin RTTs, optimistic server response times, and optimistic SSL and HTTP/2 setups. The decision to exclude highly variable runs from the initial golden set is likely responsible for providing a rosier outlook on the difference between predicting the median vs. predicting the minimum than was realistically achievable. As a result, we're systematically underestimating. Explanations for the top couple of errors found below.
1 = either we track our accuracy against what we're explicitly attempting to simulate or we change our simulation to inject more pessimism. I strongly dislike the latter approach but am happy to discuss if you disagree :) |
I did correct many of the redirects, but I did not take into account
I definitely did a subset in the interest of time. I attempted to make sure the sets were equivalent, but I was admittedly a bit cavalier.
I didn't know about the larger 1000 URL set. So it was whittled down based on just variance? We should definitely do something similar again, or at least measure the variance in our existing set of URLs and see what bad apples we have.
Taking the minimum seems like maximum optimism. Would you described lantern with the same level of optimism? |
For making collection easier, I would like to transition to a cloud-based operation. (it is very time consuming rn. especially since my corp machine falls asleep so I have to babysit it). I don't exactly know how to go about that. @patrickhulce any ideas? Probably the same approach you did with DZL (whatever that was). In addition to convenience, it'd be a necessary step to automating the collection on a somewhat-regular basis. |
More or less randomness + variance. If I had thought ahead I would have been much more methodical about it :) Basically I randomly selected 110% of the URLs we wanted to keep in a basket and threw out the worst 10% that had very high variance.
Given a fixed graph and a move to totally ignore observed per-origin RTT, yeah I would say that the characteristics of our simulation produce the maximally optimistic result. The only thing pushing results to be more pessimistic is the pessimistic graph that includes things that potentially shouldn't be included, but in many cases there simply aren't any such things to include and so we remain maximally optimistic. If min is too extreme, something like the 25th or 10th percentile might make sense. Part of the motivation here is that there's frequently a bimodal distribution and by choosing the TTI-median we end up comparing ourselves with something that is absolute worst-case scenario, which lantern is simply never going to try to match.
I agree though this will likely come with it's own subtly different perf characteristics too to complicate things 😞 |
If we used PSI / LR (for the "unthrottled desktop" runs), at least we'd be accounting for things that will possibly be meaningful :) |
ya, if the lantern collection script is G2G as-is I can put a script together for automating it on there and dumping results to cloud storage |
Ooooh this is a great idea! Any change of exposing a "get trace and devtools log" to public PSI? |
Not public, but we have the capability to grab that stuff if we hit LR internally. So it'd amount to running the collection script on borg (+ hitting the LR api, which is easy enough). |
quick summary of more action items we decided today:
|
I'd be very curious to see how many metrics change if we base this on FCP instead. TTI is so variable anyway I have a hunch that it's less indicative for the other metrics. |
sgtm. let's try varying that next time we do a full collection. |
This PR added the collection scripts: #9662 Just realized we are selecting the golden run based on 75pct FCP. Apparently we want to use TTI instead (#9887 (comment)). Gotta fix that. |
CPU throttling on WPT braindump
My understanding of CPU throttling in WPT:
status as of jan 2020:
Updated Timeline of changes
Summarizing as good/bad months...
|
|
There's a document outlining our plans here and we'll be converting it to issues shortly. doc
Update in #9887 (comment) but tl;dr some things are fixed but not everything looks right still.
tracked by #10358 |
The variance mission will never end, but we're done with this as a standalone effort for now. |
Harkening to the days of Operation Yaquina Bay, we've got a new challenge in front of us...
Point Reyes is the windiest place on the Pacific Coast. And much like wind makes the physical world oscillate, variance makes our numbers vibrate.
We have a few questions we need answered to get our Lantern-driven simulation in tip-top shape.
Questions
Actions
unthrottled-assets
traces with oopif disabled [@connorjclark]Team, please update this with anything it's missing.
The text was updated successfully, but these errors were encountered: