-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up like patch guidelines are not optimal and could be improved #2593
Comments
Could the fishtest framework be extended to allow testing for speed? For example by running a bench (possibly on some new, chosen positions). This way we could get fast performance change measurments on multiple platforms. (noob's machines being similar and in high numbers may have to be accounted for.) This would have 3 effects:
|
Excellent suggestion ! |
I don't like measuring speedups with strict LTC bounds. |
So, we have 3 recent speedup patches that I'd like to refer to in this issue (see references above). I do agree that we need some more clarity, so let's consider some of the arguments:
The first argument leads to an easy rule to follow. Passing our STC + LTC tests is not easy, and doing so solidly shows that there is a benefit. This makes discussion on the merit of the patch a rare thing (only for very invasive patches). It is hard to achieve a speedup that manages this... but this is similar for many of the eval/search patches as well. The second argument can be understood as well. A speedup will be good at any TC, and as such is risk-free, contrary to e.g. an eval/search patch. However, the benefit is likely to become smaller with increasing TC, as Elo grows more slowly with increased speed at higher TC. Also, while the risk wrt. playing strength is small, there is a risk concerning the maintenance of the code, the evolution of the code, or the chance of introducing subtle bugs. So, the latter argument raises two questions:
Having written all this, I feel that our current guidelines do make sense... they give some flexibility to wave 'trivial' speedup patches through (keeping in mind the caveats of speedup measurements), while they apply standard rules in cases the patches are not considered trivial. |
Here's a nice reference to a previous discussion regarding speed up measurement. Nothing conclusive with many question left opened though but instructive. Its a reccuring topic of course. Please note that Marco Costalba itself claimed test succeed with [-4,0] bounds for a .32 ELO gain. Mine was .72 ELO gain with [.25,.1.75] and was labelled as fail. Again, not arguing against the people judgement here nor complaining. But I want to highlight the fact that unappropriate guidelines lead to subjective appreciation that could go either way. Hence I reiterate the importance to try to clear that issue by improving the guidelines to the benefit of a better, stronger engine. |
Fishcooking is full of usefull info. Specially in this one where it is stated that LTC are not required for speed up. Why did that changes? |
One last: Reliable speed comparison: some math required |
Summarizing the actual discussion including previous one found on fishcooking, for speed up like patch, we have: STC [-0.5, 1.50] Could lead to false positive up to 20% of the time Previous guinelines clearly stated: no LTC required. This lead to me to update the actual guidelines to: STC [-0.5, 1.50], then STC [0.25, 1.75] And we are good to go. Fair enough? |
At least to have an information on LTC ELO gain and verify it is still OK, I will suggest Here 2 tests of the same speed up patch with 2 different LTC bounds: The Elo assessment is still good with [-1.5, 0.5] |
I agree. The ELO assessment should give a clue on how the speed up scales. This could be added as well. I still beleive that testing STC with LTC bounds could prevent unwanted false positive though. |
STC [0.25, 1.75] is not easy. Perhaps STC [0, 1.5] is balanced : |
Seems reasonable to me. +1 I'd like to add that having specifics test bounds for speed up patch clearly indicated that thoses are in a class of there own, not to be mixed with other type of patches. A plus. |
I'm still in the process of getting some data to map speedups to Elo... I think this is a first step before we rush to establishing things. No need to rush. |
Agreed, no need to rush. Always best to analyse first. Just curious, what are you using as data set and methodology to map that? Can I help in any way? |
I'll just be using games with time odds. So, tc=11.600+0.116 vs tc=10.000+0.100 is equivalent to a 16% speedup. That's right now the only data point I have, this is 32.4 +- 3.1 Elo. |
I'll be doing that for 16%, 8% and 4% speedup both at STC and LTC (20k games at STC, maybe a bit less at LTC). This should allow to get some interpolation near ~1% speedup / 0.5 - 1 Elo. |
Thanks a lot. |
Just a thought: how about playing 200,000 games at STC? Or maybe even 400,000 games? |
Time odds and speedup is not the same. Speedup only gains time when the engine is running, while extra time applies a benefit in addition to every possible non operating state due to internal and external delays (gui, OS, latencies etc). Im sure that a fast test of speedup vs time odds will show it immediately. Logic and experience also cant believe an 1% speedup to be worth 1.3 LTC elo. |
Thanks @vondele for these interessant results. But how can we conclude now ? |
I propose we essentially stick to the current guidelines (which I'll try to formulate a little more cleanly in the wiki). Very simple speedups (obviously correct, no additional book-keeping, concise, etc) can be proposed on the basis of well done speedup measurements that are verified independently if they are in the range ~0.50%. Larger patches (and the three PRs linked to in this issue fall in this category), require standard testing (and thus speedups >0.7%). As with all patches, even passing that is no guarantee the PR will be merged, but it's a solid argument. Where to draw the line between these cases remains ultimately a decision of the maintainer. |
@vondele This is good work. The LTC curve seems a bit high to me, but without more data, I can't argue. The patch I submitted for instance shows around 1% speed up on my machine (Windows with BMI2 support compile with GCC 9.3.0 on Cygwin). It doesn't make it to fishtest though even with a STC showing up to 6 ELO gain. I can't explain this with your number. Maybe we are in the situation of confirmation bias case, but I have nothing more to add. So lets move forward with this. I'm not sure any speed up will be able to make it though from this day forward. Time will tell.. |
Recently, many decent patch passed STC with honors but didn't make it to LTC with [0.25, 1.75] bounds.
Like some stated earlier, speed up like patch doesn't scale well for LTC test. Speed up enable SF to get to higher depth significantly more on low TC. On higher TC, only logics improvement can really make a difference.
Here are the actual guidelines:
Quoting from https://github.com/glinscott/fishtest/wiki/Creating-my-first-test#non-functional-changes :
We will verify the speed up, for 3 reasons:
The speedup needs to be statistically significant and not just random noise.
The speedup needs to be confirmed on different machines. Sometimes a speedup on one machine is a slowdown on another.
The speedup needs to be put in perspective with the code changes. If the code changes are invasive and/or ugly, only to achieve a small speedup, we will probably not accept the patch. This is subject to our appreciation.
If the above steps are not enough to clarify the opportunity to commit the patch, then the patch will go under normal STC+LTC fishtest tests. The rationale is that a speed-up is totally comparable to a normal patch: it adds complexity with the aim to improve ELO, so it makes sense to test under the same conditions.
So, for me, the guidelines regarding Speed up like patch are not optimal and should be improved. Otherwise, this type of patch won't make it anytime soon and SF will continue to lack of efficiency at low TC which is a very valid use case. Also, maintainer are doing a good job preventing regression in many area, but the lack of appropriate and optimal guidelines for thoses patch leads to unconstructive discussions based on a flawed, not optimal, criteria.
The proposition of some to have:
perfectly make sense and is a step in the right direction for those type of tests. We can also refine the definition of Speed up test to prevent relaxing the LTC rules for test that are not actual speed up to be something like:
"Code change that doesn't change the logic, hence the bench number of must remains the same" Probably stating "non functional" is also enough to clarify. Clearly, that must not be an actual logic improvement nor a simplication or a mix or all of that.
The text was updated successfully, but these errors were encountered: