-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add optimized mapreduce implementation for SkipMissing #27743
Conversation
base/missing.jl
Outdated
i > ilast && return mapreduce_first(f, op, a1) | ||
a2 = ai::eltype(itr) | ||
# Unexpectedly, the following assertion allows SIMD instructions to be emitted | ||
A[i]::eltype(itr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that this line is required to get SIMD is really intriguing. In the present case, we know A[i]
is non-missing, so it should be perfectly fine to keep. And this assertion doesn't seem to force the compiler to incorrectly assume that all entries in A
are non-missing, since we do get missing
values in the loop below.
Minimal reproducer:
function g(A)
v = 0
# Uncomment to get SIMD
# A[1]::Base.nonmissingtype(eltype(A))
@simd for i = 1:length(A)
@inbounds ai = A[i]
if ai !== missing
v += ai
end
end
return v
end
@code_llvm g([1,missing])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe open the minimal reproducer in a standalone issue in order to get more attention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like something might be preventing LICM (loop-invariant-code-motion), since in the non-simd version, it's reloading the data pointer on every iteration, and any non-functional code that uses it in the header will allow for SIMD.
julia> _false = false
julia> function g(A)
v = 0
# Uncomment to get SIMD
x = A[1]
if isa(x, Missing) && _false
error()
end
@simd for i = 1:length(A)
@inbounds ai = A[i]
if ai !== missing
v += ai
end
end
return v
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I can file a separate issue if that's useful, but @vtjnash can probably write a much more useful description than I can.
@nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan |
Nanosoldier results are really contrasted: @ararslan Could something weird be going on with the comparison? I ask because at #27386 there are improvements which I cannot reproduce locally either. |
Probable explanation here: #27386 (comment). Let's run it again. @nanosoldier |
Restarted the server. @nanosoldier |
Unfortunately, I don't think that was the case here. I've not (yet) known Nanosoldier to blatantly lie to us. In this case, it reported comparing 9e7f85d against 0978251. Those are the correct commits — neither the PR nor master had the random change incorporated. |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan |
Indeed, the regressions are still there. I'm really stuck. I've tried again locally after rebasing on master, and for example I get this for julia> master["array"]["(\"skipmissing\", sum, Float64, false)"]
BenchmarkTools.Trial:
memory estimate: 16 bytes
allocs estimate: 1
--------------
minimum time: 1.462 μs (0.00% GC)
median time: 1.524 μs (0.00% GC)
mean time: 1.568 μs (0.00% GC)
maximum time: 33.038 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> pr["array"]["(\"skipmissing\", sum, Float64, false)"]
BenchmarkTools.Trial:
memory estimate: 64 bytes
allocs estimate: 4
--------------
minimum time: 732.000 ns (0.00% GC)
median time: 793.000 ns (0.00% GC)
mean time: 5.053 μs (0.00% GC)
maximum time: 42.571 ms (0.00% GC)
--------------
samples: 10000
evals/sample: 1 What can we do about that? Would somebody check locally whether my results are confirmed? |
@nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan |
Based on the AbstractArray methods. Allows the compiler to emit SIMD instructions, for SkipMissing{Vector{Int}}, but also improves performance for SkipMissing{Vector{Float64}}.
I've found out why Nanosoldier find regressions. I'm still not sure why I wasn't able to reproduce the problem before, but at least now I see it when running BaseBenchmarks. It only happens for vectors of ~1000 entries or less (as in the benchmark, but not in my other checks), and only when I've fixed it by only using the new method when There's still a 20% regression locally for @nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan |
Good news. Most regressions have disappeared and turned into improvements. The only remaining regressions which are related are for So barring objections I'll merge this tomorrow. |
Awesome! But, just to keep in mind, another common operation on datasets is to compute mean per group, where typically size per group may be < 1000. |
Yeah, I realized that too. Anyway the regression isn't too bad, and the goal is really to enable SIMD, which will be faster in all cases (except maybe very small vectors). |
i > ilast && return Some(mapreduce_first(f, op, a1)) | ||
a2 = ai::eltype(itr) | ||
# Unexpectedly, the following assertion allows SIMD instructions to be emitted | ||
A[i]::eltype(itr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed by d803472 (missing align
attribute on dereferenceable Arguments)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's check this: #28035.
Based on the AbstractArray methods. Allows the compiler to emit SIMD instructions, for
SkipMissing{Vector{Int}}
, but also improves performance forSkipMissing{Vector{Float64}}
.The performance improvement is clear for
Int
(3×-5× less time), but less so forFloat64
(-30%). However, since SIMD now works forInt
, there's a chance it could eventually be enabled forFloat64
, and it should be quite easier than with the genericmapreduce
fallback. Also, these 30% are enough to get as fast as R (maybe even a bit faster), which is an important reference point (see this Discourse thread).Fixes #27679.