Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optimized mapreduce implementation for SkipMissing #27743

Merged
merged 3 commits into from
Jul 6, 2018
Merged

Conversation

nalimilan
Copy link
Member

Based on the AbstractArray methods. Allows the compiler to emit SIMD instructions, for SkipMissing{Vector{Int}}, but also improves performance for SkipMissing{Vector{Float64}}.

The performance improvement is clear for Int (3×-5× less time), but less so for Float64 (-30%). However, since SIMD now works for Int, there's a chance it could eventually be enabled for Float64, and it should be quite easier than with the generic mapreduce fallback. Also, these 30% are enough to get as fast as R (maybe even a bit faster), which is an important reference point (see this Discourse thread).

Fixes #27679.

# With Vector{Int}
X1 = rand(Int, 10_000_000);
X2 = Vector{Union{Missing,Int}}(X1);
X3 = ifelse.(rand(length(X2)) .< 0.9, X2, missing);

# With Vector{Float64}
Y1 = rand(Float64, 10_000_000);
Y2 = Vector{Union{Missing,Float64}}(Y1);
Y3 = ifelse.(rand(length(Y2)) .< 0.9, Y2, missing);

using BenchmarkTools

# Before

julia> @btime sum(skipmissing(X1));
 7.316 ms (2 allocations: 32 bytes)

julia> @btime sum(skipmissing(X2));
 18.175 ms (2 allocations: 32 bytes)

julia> @btime sum(skipmissing(X3));
 28.240 ms (2 allocations: 32 bytes)

julia> @btime sum(skipmissing(Y1));
 13.816 ms (2 allocations: 32 bytes)

julia> @btime sum(skipmissing(Y2));
 18.254 ms (2 allocations: 32 bytes)

julia> @btime sum(skipmissing(Y3));
 28.430 ms (2 allocations: 32 bytes)

# After

julia> @btime sum(skipmissing(X1));
  5.918 ms (4 allocations: 64 bytes)

julia> @btime sum(skipmissing(X2));
  6.113 ms (4 allocations: 64 bytes)

julia> @btime sum(skipmissing(X3));
  6.142 ms (4 allocations: 64 bytes)

julia> @btime sum(skipmissing(Y1));
  5.830 ms (4 allocations: 64 bytes)

julia> @btime sum(skipmissing(Y2));
  13.816 ms (4 allocations: 64 bytes)

julia> @btime sum(skipmissing(Y3));
  19.018 ms (4 allocations: 64 bytes)

@nalimilan nalimilan added performance Must go faster missing data Base.missing and related functionality labels Jun 23, 2018
base/missing.jl Outdated
i > ilast && return mapreduce_first(f, op, a1)
a2 = ai::eltype(itr)
# Unexpectedly, the following assertion allows SIMD instructions to be emitted
A[i]::eltype(itr)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that this line is required to get SIMD is really intriguing. In the present case, we know A[i] is non-missing, so it should be perfectly fine to keep. And this assertion doesn't seem to force the compiler to incorrectly assume that all entries in A are non-missing, since we do get missing values in the loop below.

Minimal reproducer:

function g(A)
    v = 0
    # Uncomment to get SIMD
    # A[1]::Base.nonmissingtype(eltype(A))
    @simd for i = 1:length(A)
        @inbounds ai = A[i]
        if ai !== missing
            v += ai
        end
    end
    return v
end

@code_llvm g([1,missing])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe open the minimal reproducer in a standalone issue in order to get more attention?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like something might be preventing LICM (loop-invariant-code-motion), since in the non-simd version, it's reloading the data pointer on every iteration, and any non-functional code that uses it in the header will allow for SIMD.

julia> _false = false
julia> function g(A)
           v = 0
           # Uncomment to get SIMD
           x = A[1]
           if isa(x, Missing) && _false
               error()
           end
           @simd for i = 1:length(A)
               @inbounds ai = A[i]
               if ai !== missing
                   v += ai
               end
           end
           return v
       end

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can file a separate issue if that's useful, but @vtjnash can probably write a much more useful description than I can.

@nalimilan
Copy link
Member Author

@nanosoldier runbenchmarks(ALL, vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@nalimilan
Copy link
Member Author

Nanosoldier results are really contrasted: sum is about twice faster when there are 10% of missing values in the input, but about 10%-30% slower when there are no missing values. The pattern is quite consistent across types, so it doesn't look like noise. That's weird, since locally I see only one regression (most likely spurious since it's only for Int8).

@ararslan Could something weird be going on with the comparison? I ask because at #27386 there are improvements which I cannot reproduce locally either.

@nalimilan
Copy link
Member Author

Probable explanation here: #27386 (comment). Let's run it again.

@nanosoldier runbenchmarks("union", vs = ":master")

@ararslan
Copy link
Member

Restarted the server. @nanosoldier runbenchmarks("union", vs=":master")

@mbauman
Copy link
Member

mbauman commented Jun 26, 2018

Probable explanation here: #\27386 (comment).

Unfortunately, I don't think that was the case here. I've not (yet) known Nanosoldier to blatantly lie to us. In this case, it reported comparing 9e7f85d against 0978251. Those are the correct commits — neither the PR nor master had the random change incorporated.

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@nalimilan
Copy link
Member Author

Indeed, the regressions are still there. I'm really stuck. I've tried again locally after rebasing on master, and for example I get this for Vector{Union{Float64,Missing}} without missing values, while the Nanosoldier report says it's a 32% regression! The increase in allocations is consistent though (and it's negligible).

julia> master["array"]["(\"skipmissing\", sum, Float64, false)"]
BenchmarkTools.Trial:
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     1.462 μs (0.00% GC)
  median time:      1.524 μs (0.00% GC)
  mean time:        1.568 μs (0.00% GC)
  maximum time:     33.038 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> pr["array"]["(\"skipmissing\", sum, Float64, false)"]
BenchmarkTools.Trial:
  memory estimate:  64 bytes
  allocs estimate:  4
  --------------
  minimum time:     732.000 ns (0.00% GC)
  median time:      793.000 ns (0.00% GC)
  mean time:        5.053 μs (0.00% GC)
  maximum time:     42.571 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

What can we do about that? Would somebody check locally whether my results are confirmed?

@nalimilan
Copy link
Member Author

@nanosoldier runbenchmarks("union", vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

Based on the AbstractArray methods. Allows the compiler to emit SIMD instructions,
for SkipMissing{Vector{Int}}, but also improves performance for SkipMissing{Vector{Float64}}.
@nalimilan
Copy link
Member Author

nalimilan commented Jul 5, 2018

I've found out why Nanosoldier find regressions. I'm still not sure why I wasn't able to reproduce the problem before, but at least now I see it when running BaseBenchmarks. It only happens for vectors of ~1000 entries or less (as in the benchmark, but not in my other checks), and only when !(eltype(A) >: Missing). It turns out the benchmarks tested that when there are no missing values, because a comprehension narrowed the type automatically, which I hadn't realized. So I wasn't doing the right comparison.

I've fixed it by only using the new method when eltype(A) >: Missing. The specialized method is clearly not supposed to be faster when !(eltype(A) >: Missing) since it adds two loops per block to identify the two first non-missing entries.

There's still a 20% regression locally for Array{Float32}, which is probably due to the fact that vectorization currently doesn't work for floats, so the new method doesn't really add much (contrary to integers), and the extra code has a fixed cost which is significant for arrays of ~1000 entries. Let's see whether Nanosoldier confirms it.

@nanosoldier runbenchmarks("union", vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@nalimilan
Copy link
Member Author

Good news. Most regressions have disappeared and turned into improvements. The only remaining regressions which are related are for Float64 and Float32. That sounds silly given that it's one of the main use cases for missing, but actually they are only there for arrays of length 1000, and progressively disappear for larger vectors, turning into improvements. I've filed JuliaCI/BaseBenchmarks.jl#220 to use larger arrays in the benchmark since that's more representative of work on real data sets. Anyway, this PR is one step in the direction of enabling SIMD instructions, which will make a large difference (as currently for integers).

So barring objections I'll merge this tomorrow.

@nalimilan nalimilan merged commit a60119b into master Jul 6, 2018
@nalimilan nalimilan deleted the nl/mapreduce branch July 6, 2018 20:54
@matthieugomez
Copy link
Contributor

matthieugomez commented Jul 7, 2018

Awesome! But, just to keep in mind, another common operation on datasets is to compute mean per group, where typically size per group may be < 1000.

@nalimilan
Copy link
Member Author

Yeah, I realized that too. Anyway the regression isn't too bad, and the goal is really to enable SIMD, which will be faster in all cases (except maybe very small vectors).

i > ilast && return Some(mapreduce_first(f, op, a1))
a2 = ai::eltype(itr)
# Unexpectedly, the following assertion allows SIMD instructions to be emitted
A[i]::eltype(itr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by d803472 (missing align attribute on dereferenceable Arguments)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's check this: #28035.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
missing data Base.missing and related functionality performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants