Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

microoptimization on rand(::AliasTable) #1831

Merged
merged 2 commits into from
Feb 5, 2024

Conversation

adienes
Copy link
Contributor

@adienes adienes commented Jan 27, 2024

doesn't necessary improve all the CI benchmarks, but I think those are kind of artificial. on all of my actual use cases I've tried, it's faster to use an ifelse and omit the @inbounds

status quo:

julia> @benchmark rand($rng, s) setup=(s = AliasTable(normalize(rand(10), 1)))
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):   7.132 ns … 33.909 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     11.679 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   11.606 ns ±  1.195 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                              ▁ ▂▂ ▃▄▆▁▇▆▂▇▇▂▇█▃▆▅▅ ▃▃         
  ▁▁▁▁▁▁▁▁▁▁▂▂▂▁▂▂▂▃▃▃▄▄▃▄▆▆▅▇█▆██▇███████████████████▆██▄▅▅▃ ▄
  7.13 ns         Histogram: frequency by time        13.8 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark rand($rng, s) setup=(s = AliasTable(normalize(rand(100), 1)))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  11.125 ns … 28.416 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     13.083 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   13.056 ns ±  0.507 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                  ▂    ▁ █▁▁▁▁ ▇               
  ▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▃▃▃▃▃▃▃▅▄▅▅▅▆█▇▇▇██████████▇▇▆▆▆▅▇▄▃▃▃▃▃ ▄
  11.1 ns         Histogram: frequency by time          14 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark rand($rng, s) setup=(s = AliasTable(normalize(rand(100_000), 1)))
BenchmarkTools.Trial: 2899 samples with 1000 evaluations.
 Range (min … max):  12.042 ns … 49.917 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     12.917 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   13.013 ns ±  1.244 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                        ▂▃ ▅▇ ▇█▇ ▆▆ █▆ ▅▂ ▃                   
  ▂▁▂▁▁▂▁▂▃▁▂▄▁▄▄▁▆▆▁▇█▁██▁██▁███▁██▁██▁██▁██▁▇▆▁▅▄▁▅▃▁▃▃▁▂▂▂ ▄
  12 ns           Histogram: frequency by time        13.7 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

PR:

julia> @benchmark rand($rng, s) setup=(s = AliasTable(normalize(rand(10), 1)))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  5.458 ns … 41.708 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.583 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.612 ns ±  0.571 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                  ▂       █▁       ▂                          
  ▂▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁█▁▁▁▁▁▁▁██▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▂ ▂
  5.46 ns        Histogram: frequency by time        5.75 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark rand($rng, s) setup=(s = AliasTable(normalize(rand(100), 1)))

BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  5.500 ns … 23.375 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.584 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.621 ns ±  0.487 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                     █        ▂                               
  ▂▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▂ ▂
  5.5 ns         Histogram: frequency by time        5.75 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark rand($rng, s) setup=(s = AliasTable(normalize(rand(100_000), 1)))
BenchmarkTools.Trial: 2874 samples with 1000 evaluations.
 Range (min … max):  6.000 ns … 42.375 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.125 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.140 ns ±  0.764 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                  █       ▇                                   
  ▂▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁█▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▂ ▂
  6 ns           Histogram: frequency by time        6.29 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

@codecov-commenter
Copy link

codecov-commenter commented Jan 27, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (c1705a3) 85.94% compared to head (af22151) 85.94%.

❗ Current head af22151 differs from pull request most recent head 7d8d6d9. Consider uploading reports for the commit 7d8d6d9 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1831      +/-   ##
==========================================
- Coverage   85.94%   85.94%   -0.01%     
==========================================
  Files         144      144              
  Lines        8658     8656       -2     
==========================================
- Hits         7441     7439       -2     
  Misses       1217     1217              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@devmotion
Copy link
Member

From the docs of ifelse:

In some cases, using `ifelse` instead of an `if` statement can eliminate the branch in generated code and provide higher performance in tight loops.

Maybe something along these lines is the reason for the performance improvements? Did you compare the output of @code_llvm?

@adienes
Copy link
Contributor Author

adienes commented Jan 31, 2024

that's certainly plausible. here is the output of @code_llvm on my computer. left is status quo, right is PR

I do not really know how to interpret it though :)

image

@Zentrik
Copy link

Zentrik commented Feb 1, 2024

It looks like the ifelse does remove a branch in the LLVM IR however the lack of @inbounds adds some branching (probably cheaper as we never expect for it to be out of bounds). I suspect ifelse and @inbounds would produce the fastest result.

@adienes
Copy link
Contributor Author

adienes commented Feb 1, 2024

I did try with @inbounds here and it is actually ever-so-slightly slower (but basically equivalent) on my machine. Given the performance equivalence, I defaulted to the "safer" option.

however, I don't know if that result is cpu-specific or not though. is it possible @inbounds would have a bigger impact on a different architecture?

@Zentrik
Copy link

Zentrik commented Feb 1, 2024

That seems plausible to me.

I imagine the CPU is able to eliminate the cost of bounds checking whilst benchmarking due to branch prediction.

@adienes
Copy link
Contributor Author

adienes commented Feb 1, 2024

makes sense. happy to add back in the @inbounds

anything else needed? I can do more careful benchmarking across a range of inputs, but tbh this seems like a pretty quick win

@devmotion
Copy link
Member

IMO we should only use @inbounds if it is checked that the indices are valid - but as the function is written currently i is only guaranteed to be a valid index of s.alias, not of s.accept.

@adienes
Copy link
Contributor Author

adienes commented Feb 1, 2024

that's true --- personally I don't mind omitting the @inbounds, but just to play devil's advocate, couldn't a new(probs::AbstractVector) inner constructor be added to AliasTable to enforce the only the constructor AliasTable(probs) is valid, thus alias and accept are guaranteed to have the same indices?

@devmotion
Copy link
Member

It would definitely be less likely to run into BoundsErrors, but even with an inner constructor you are allowed to push! to s.alias and/or s.accept, so it would not be guaranteed that the indices are valid.

@adienes
Copy link
Contributor Author

adienes commented Feb 5, 2024

makes sense to me --- so this PR then improves safety as well as performance!

Co-authored-by: David Widmann <devmotion@users.noreply.github.com>
Copy link
Member

@devmotion devmotion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@devmotion devmotion merged commit 3d71a2b into JuliaStats:master Feb 5, 2024
11 of 14 checks passed
@vancleve
Copy link

vancleve commented Mar 4, 2024

cool, I had thought that @inbounds would improve performance, but maybe not significantly.

Seems I should close JuliaStats/StatsBase.jl#630 then?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants