Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testing: consider calling ReadMemStats less during benchmarking #20875

Open
josharian opened this issue Jun 30, 2017 · 20 comments
Open

testing: consider calling ReadMemStats less during benchmarking #20875

josharian opened this issue Jun 30, 2017 · 20 comments
Assignees
Labels
NeedsFix The path to resolution is known, but the work has not been done. Performance
Milestone

Comments

@josharian
Copy link
Contributor

josharian commented Jun 30, 2017

https://golang.org/cl/36791 reduced the number of times benchmarks call ReadMemStats. However, the implementation was incorrect (#20590, #20863), and it was rolled back (https://golang.org/cl/46612 and https://golang.org/cl/47350).

The rationale for the rollback is that ReadMemStats is now fast (https://golang.org/cl/34937). However, using tip as of June 30, 2017 (445652f), cpuprofiling of package sort's benchmarks still shows almost half of all execution time in ReadMemStats. See sort-cpu.pdf.

So for 1.10, either ReadMemStats should be made cheaper still, or we should re-roll https://golang.org/cl/36791 with a better implementation; see the initial patchsets of https://golang.org/cl/46612 for improvements.

cc @bradfitz @aclements @meirf @ALTree

@josharian josharian added this to the Go1.10 milestone Jun 30, 2017
@bradfitz
Copy link
Contributor

Sorry about that. I suspected it might be the case that ReadMemStats would still be noticeably expensive after the rollback, but I also suspected you would file this very bug. I would've been happy rolling forward a few weeks ago, but at this point we're trying to lock down for a release.

@bradfitz bradfitz added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jun 30, 2017
@josharian
Copy link
Contributor Author

I also suspected you would file this very bug.

How did you guess? :)

at this point we're trying to lock down for a release.

But of course; not a problem. Happy to fix in 1.10 instead.

@aclements
Copy link
Member

Interesting. Do you happen to know how many times ReadMemStats is called versus how much time each call takes? After I optimized it I wasn't able to get it to take more than a few microseconds even under stress testing.

@josharian
Copy link
Contributor Author

Do you happen to know how many times ReadMemStats is called versus how much time each call takes?

ReadMemStats gets called a lot--every iteration, via StartTimer/StopTimer, e.g. at https://github.com/golang/go/blob/master/src/sort/sort_test.go#L640.

@aclements
Copy link
Member

ReadMemStats gets called a lot--every iteration, via StartTimer/StopTimer, e.g. at https://github.com/golang/go/blob/master/src/sort/sort_test.go#L640.

Oh, geez, okay. :)

If there's a way to safely avoid ReadMemStats, that's fine, but I also don't think we should over-optimize for this one weirdness in this one benchmark

@josharian
Copy link
Contributor Author

If there's a way to safely avoid ReadMemStats, that's fine, but I also don't think we should over-optimize for this one weirdness in this one benchmark

I think there is, more or less described above. And for better or for worse, called StartTimer/StopTimer every iteration is not all that rare. :) Anyway, this is on my plate to try again for 1.10.

@josharian josharian self-assigned this Jul 21, 2017
@bradfitz bradfitz added NeedsFix The path to resolution is known, but the work has not been done. Performance labels Nov 28, 2017
@bradfitz bradfitz modified the milestones: Go1.10, Go1.11 Nov 28, 2017
@gopherbot gopherbot removed the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 28, 2017
@bradfitz bradfitz modified the milestones: Go1.11, Go1.12 Jun 19, 2018
@iand
Copy link
Contributor

iand commented Sep 26, 2018

How about an opt-in approach: a new SkipMemStats method on B. When called it would disable the collection of memory usage. Running the benchmark with -test.benchmem would override SkipMemStats as would a later call to ReportAllocs in the same benchmark if people cared to do that.

@iand
Copy link
Contributor

iand commented Sep 27, 2018

I implemented the opt-in I suggested above and tried it out on the sort package tests. I found it made very little difference in the actual benchmark. This is on linux/amd64.

The cpu profile of master on my system shows that ReadMemStats has negligible impact. This may just be an issue with the speed of ReadMemStats on some architectures. cc @josharian @aclements

Benchmarks of sort when calling SkipMemStats compared to master:

benchmark                         old ns/op      new ns/op      delta
BenchmarkSearchWrappers-8         88.1           89.3           +1.36%
BenchmarkSortString1K-8           137554         139288         +1.26%
BenchmarkSortString1K_Slice-8     105527         104311         -1.15%
BenchmarkStableString1K-8         193180         195908         +1.41%
BenchmarkSortInt1K-8              82989          84371          +1.67%
BenchmarkStableInt1K-8            98501          99706          +1.22%
BenchmarkStableInt1K_Slice-8      53439          53342          -0.18%
BenchmarkSortInt64K-8             8134449        8232955        +1.21%
BenchmarkSortInt64K_Slice-8       4556906        4604299        +1.04%
BenchmarkStableInt64K-8           8856651        8728228        -1.45%
BenchmarkSort1e2-8                46021          43748          -4.94%
BenchmarkStable1e2-8              90498          89224          -1.41%
BenchmarkSort1e4-8                9528589        9573428        +0.47%
BenchmarkStable1e4-8              28448203       28591226       +0.50%
BenchmarkSort1e6-8                1484072102     1470491590     -0.92%
BenchmarkStable1e6-8              5864716595     5833299379     -0.54%

@ALTree
Copy link
Member

ALTree commented Sep 27, 2018

@iand you should run each benchmark multiple times (see the -count flag of go test), and then compare the output using benchstat (instad of benchcmp). It'll give you a report that is less likely to be infuenced by random external noise during the test.

@iand
Copy link
Contributor

iand commented Sep 27, 2018

@ALTree I did that too, but didn't post the results here since the profile showed that ReadMemStats was not affecting the run times on my system

@josharian
Copy link
Contributor Author

How about an opt-in approach

The approach I sketched (in the linked CLs) should not require any API. I still think it is worth exploring, and it remains in my queue.


I originally found that:

cpuprofiling of package sort's benchmarks still shows almost half of all execution time in ReadMemStats

@iand, you report:

The cpu profile of master on my system shows that ReadMemStats has negligible impact.

That's an interesting mismatch. I will have to re-measure to see whether I can still reproduce.

@mvdan
Copy link
Member

mvdan commented Jan 13, 2019

I agree with Josh that calling StartTimer and StopTimer at every iteration isn't rare. For example, I tend to do that often when the setup/cleanup cost per benchmark iteration is comparable to the benchmark work itself.

And this is a big problem when both of these costs are small. For example, right now I was benchmarking a function that took just under a microsecond, and the pprof cpu profile showed that ReadMemStats took over 60% of the CPU, while my function took less than 3%.

Removing Start/StopTimer from my benchmark is a possibility, but then I'm not really benchmarking my function anymore. The numbers could get better or worse by only changing the setup/cleanup code.

The cost of starting and stopping the timer can also throw off -benchtime=duration estimates, if its cost is much larger than the benchmarked func itself. For example, see #27217 (comment), where a go test -benchtime=0.1s ran for over five seconds. I can provide a small standalone example if that would help illustrate the point.

@iand
Copy link
Contributor

iand commented Jan 13, 2019

@mvdan what OS are you testing on? Are you able to repeat the results across Linux/Win/Mac?

@mvdan
Copy link
Member

mvdan commented Jan 13, 2019

I am on go version devel +5f699e400a Thu Jan 10 21:21:34 2019 +0000 linux/amd64. I'd like to have Windows/Mac machines around to test things with, but I'd rather not pay for those :)

mvdan added a commit to mvdan/benchinit that referenced this issue Jan 14, 2019
It can completely mess up the benchmark numbers for init functions that
were too small. Moreover, it could make the -benchtime estimates be way
off. For example, 'benchinit cmd/go' was taking over a minute to run the
benchmark, instead of the expected ~1s.

The benchtime estimate being off is likely the upstream issue
golang/go#27217.

The fact that StartTimer and StopTimer are expensive is being tracked in
golang/go#20875.
@andybons andybons removed this from the Go1.13 milestone Jul 8, 2019
@seebs
Copy link
Contributor

seebs commented Apr 15, 2020

Probably irrelevant to most benchmarks, but: I suspect the impact may be very widely variable because ReadMemStats needs to stop the world. So single-threaded benchmarks might not be affected, but in parallel runs, you might see very weird impact, and a lot of that impact won't be directly attributable to ReadMemStats itself.

@aclements
Copy link
Member

Stopping the world is typically < 50us even in parallel runs, since it's also critical to GC latency. Of course, if a benchmark is usually measured in nanoseconds, this is still really slow by comparison. :)

I wonder if testing's ReadMemStats is less of a problem on tip now that it no longer blocks during a GC.

Also, it may be that @mknyszek's unstable runtime metrics API could help here. With that API, the testing package could ask for just the couple metrics it actually wants. That would be cheaper to collect and maybe we could even do just those without stopping the world.

@mknyszek
Copy link
Contributor

Stopping the world is typically < 50us even in parallel runs, since it's also critical to GC latency. Of course, if a benchmark is usually measured in nanoseconds, this is still really slow by comparison. :)

I wonder if testing's ReadMemStats is less of a problem on tip now that it no longer blocks during a GC.

It should be a lot better, especially if the benchmark is allocating a lot, but as you say it's still going to be problematic for very small benchmarks.

(See #19812 for details.)

Also, it may be that @mknyszek's unstable runtime metrics API could help here. With that API, the testing package could ask for just the couple metrics it actually wants. That would be cheaper to collect and maybe we could even do just those without stopping the world.

Unfortunately the testing package measures the number of mallocs, which will require stopping the world for the foreseeable future. This is because we need to flush those stats out of each mcache to get an exact number (we need to count on the fast path). We could expose an "approximate" mallocs statistic which may or may not be monotonic (I'm not sure how difficult it'll be to do that given that we do that whole "assume the span gets filled" trick). That doesn't really help the testing package, though, which probably wants an exact number.

It's still probably worth reducing the number of calls if there's a way.

(See #37112 for details about that API, though.)

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/257647 mentions this issue: testing: grow benchmark iteration incrementally

@mknyszek
Copy link
Contributor

mknyszek commented Apr 5, 2022

I recently came across this just by happenstance, and the runtime/metrics package doesn't require a STW in any way. We should replace the call to ReadMemStats in the testing package.

Though, there's currently a monotonicity bug. The fix is up for 1.19.

@mknyszek
Copy link
Contributor

I just tried replacing ReadMemStats the easy way and there is a problem with using runtime/metrics: because it doesn't stop the world, it can't easily force flush all the allocation stats like ReadMemStats can. In sum, it's not going to be able to pick up on, say, 1 allocation from 1 run, but if you do many runs, then it will pick up an accurate average.

I'm not sure if that's a deal-breaker. It might be for AllocsPerRun, but that might be OK. I don't know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsFix The path to resolution is known, but the work has not been done. Performance
Projects
None yet
Development

No branches or pull requests