Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter sometimes slower than going via map #32303

Closed
nickrobinson251 opened this issue Jun 12, 2019 · 7 comments
Closed

filter sometimes slower than going via map #32303

nickrobinson251 opened this issue Jun 12, 2019 · 7 comments
Labels
performance Must go faster

Comments

@nickrobinson251
Copy link
Contributor

Here's are a couple different cases I found in the wild (I've not done extensive benchmarking). I found this result surprising.

Strings:

julia> haystack = repeat(["ab","cd","ef","gh"], 250);

julia> needles = ["ab","cd","gh"];

julia> using BenchmarkTools

julia> @benchmark filter(in($needles), $haystack)
BenchmarkTools.Trial:
  memory estimate:  16.41 KiB
  allocs estimate:  11
  --------------
  minimum time:     16.004 μs (0.00% GC)
  median time:      21.223 μs (0.00% GC)
  mean time:        21.417 μs (2.00% GC)
  maximum time:     1.803 ms (98.86% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark $haystack[map(in($needles), $haystack)]
BenchmarkTools.Trial:
  memory estimate:  7.16 KiB
  allocs estimate:  6
  --------------
  minimum time:     11.219 μs (0.00% GC)
  median time:      15.341 μs (0.00% GC)
  mean time:        15.030 μs (0.00% GC)
  maximum time:     490.489 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Numbers:

julia> haystack = rand(100_000);

julia> @benchmark filter(x -> x < 0.5, $haystack)
BenchmarkTools.Trial:
  memory estimate:  1.00 MiB
  allocs estimate:  16
  --------------
  minimum time:     751.703 μs (0.00% GC)
  median time:      1.102 ms (0.00% GC)
  mean time:        1.090 ms (0.00% GC)
  maximum time:     2.082 ms (0.00% GC)
  --------------
  samples:          4329
  evals/sample:     1

julia> @benchmark $haystack[map(x -> x < 0.5, $haystack)]
BenchmarkTools.Trial:
  memory estimate:  489.34 KiB
  allocs estimate:  7
  --------------
  minimum time:     378.179 μs (0.00% GC)
  median time:      544.392 μs (0.00% GC)
  mean time:        532.841 μs (4.77% GC)
  maximum time:     38.854 ms (97.98% GC)
  --------------
  samples:          9347
  evals/sample:     1

Version:

julia> versioninfo()
Julia Version 1.1.1
Commit 55e36cc (2019-05-16 04:10 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin15.6.0)
  CPU: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
@ararslan ararslan added the performance Must go faster label Jun 12, 2019
@dkarrasch
Copy link
Member

Seems like you hit this:

julia/base/abstractset.jl

Lines 337 to 343 in f49cb42

# TODO: delete mapfilter in favor of comprehensions/foldl/filter when competitive
function mapfilter(pred, f, itr, res)
for x in itr
pred(x) && f(res, x)
end
res
end

At first, I thought that this method gets called, which is synonymous to your faster methods:

filter(f, As::AbstractArray) = As[map(f, As)::AbstractArray{Bool}]

But actually, it's this one:

filter(f, a::Vector) = mapfilter(f, push!, a, empty(a))

which leads to the first code quote. In this spirit, consider this:

julia> haystack = reshape(rand(100_000), 1,100000)
1×100000 Array{Float64,2}:
 0.420187  0.60593  0.512718  0.174329    0.657166  0.304862  0.223922
julia> @benchmark filter(x -> x < 0.5, $haystack)
BenchmarkTools.Trial: 
  memory estimate:  489.33 KiB
  allocs estimate:  8
  --------------
  minimum time:     442.609 μs (0.00% GC)
  median time:      461.997 μs (0.00% GC)
  mean time:        513.938 μs (2.02% GC)
  maximum time:     35.242 ms (98.61% GC)
  --------------
  samples:          9680
  evals/sample:     1

julia> @benchmark $haystack[map(x -> x < 0.5, $haystack)]
BenchmarkTools.Trial: 
  memory estimate:  489.28 KiB
  allocs estimate:  7
  --------------
  minimum time:     442.198 μs (0.00% GC)
  median time:      462.534 μs (0.00% GC)
  mean time:        517.566 μs (2.02% GC)
  maximum time:     32.510 ms (98.60% GC)
  --------------
  samples:          9610
  evals/sample:     1

Now you trigger the code that is synonymous to your faster method.

@JeffBezanson
Copy link
Member

Yes, must be due to using repeated push!. That will be slower and use more memory due to growing the array. The other version needs a temporary Bool array, but it's 1/8th the size.

@nickrobinson251
Copy link
Contributor Author

Could filter(f, a::Vector) = mapfilter(f, push!, a, empty(a)) simply be removed, so we instead hit filter(f, As::AbstractArray) = As[map(f, As)::AbstractArray{Bool}]?

@dkarrasch
Copy link
Member

I guess it depends on how often the boolean function returns true.

julia> using BenchmarkTools

julia> haystack = rand(100_000);

julia> @benchmark filter(x -> x < 5e-5, $haystack)
BenchmarkTools.Trial: 
  memory estimate:  128 bytes
  allocs estimate:  2
  --------------
  minimum time:     83.695 μs (0.00% GC)
  median time:      83.765 μs (0.00% GC)
  mean time:        91.617 μs (0.00% GC)
  maximum time:     2.148 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> haystack = reshape(haystack, 1,100000);

julia> @benchmark filter(x -> x < 5e-5, $haystack)
BenchmarkTools.Trial: 
  memory estimate:  97.97 KiB
  allocs estimate:  7
  --------------
  minimum time:     60.317 μs (0.00% GC)
  median time:      101.090 μs (0.00% GC)
  mean time:        124.740 μs (12.38% GC)
  maximum time:     51.757 ms (99.66% GC)
  --------------
  samples:          10000
  evals/sample:     1

So the mapfilter version is a bit faster, but not much.

@dkarrasch
Copy link
Member

I just realized that there is WIP in #31929 regarding this exact issue. So I assume we can close this?

@nickrobinson251
Copy link
Contributor Author

Yes, this can be closed when that PR is merged. Thanks!

@KristofferC
Copy link
Member

Which it now is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

5 participants