`filter` sometimes slower than going via `map` #32303

nickrobinson251 · 2019-06-12T14:17:32Z

Here's are a couple different cases I found in the wild (I've not done extensive benchmarking). I found this result surprising.

Strings:

julia> haystack = repeat(["ab","cd","ef","gh"], 250);

julia> needles = ["ab","cd","gh"];

julia> using BenchmarkTools

julia> @benchmark filter(in($needles), $haystack)
BenchmarkTools.Trial:
  memory estimate:  16.41 KiB
  allocs estimate:  11
  --------------
  minimum time:     16.004 μs (0.00% GC)
  median time:      21.223 μs (0.00% GC)
  mean time:        21.417 μs (2.00% GC)
  maximum time:     1.803 ms (98.86% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark $haystack[map(in($needles), $haystack)]
BenchmarkTools.Trial:
  memory estimate:  7.16 KiB
  allocs estimate:  6
  --------------
  minimum time:     11.219 μs (0.00% GC)
  median time:      15.341 μs (0.00% GC)
  mean time:        15.030 μs (0.00% GC)
  maximum time:     490.489 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Numbers:

julia> haystack = rand(100_000);

julia> @benchmark filter(x -> x < 0.5, $haystack)
BenchmarkTools.Trial:
  memory estimate:  1.00 MiB
  allocs estimate:  16
  --------------
  minimum time:     751.703 μs (0.00% GC)
  median time:      1.102 ms (0.00% GC)
  mean time:        1.090 ms (0.00% GC)
  maximum time:     2.082 ms (0.00% GC)
  --------------
  samples:          4329
  evals/sample:     1

julia> @benchmark $haystack[map(x -> x < 0.5, $haystack)]
BenchmarkTools.Trial:
  memory estimate:  489.34 KiB
  allocs estimate:  7
  --------------
  minimum time:     378.179 μs (0.00% GC)
  median time:      544.392 μs (0.00% GC)
  mean time:        532.841 μs (4.77% GC)
  maximum time:     38.854 ms (97.98% GC)
  --------------
  samples:          9347
  evals/sample:     1

Version:

julia> versioninfo()
Julia Version 1.1.1
Commit 55e36cc (2019-05-16 04:10 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin15.6.0)
  CPU: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

The text was updated successfully, but these errors were encountered:

dkarrasch · 2019-06-12T16:58:32Z

Seems like you hit this:

julia/base/abstractset.jl

Lines 337 to 343 in f49cb42

    
           # TODO: delete mapfilter in favor of comprehensions/foldl/filter when competitive 
        
           function mapfilter(pred, f, itr, res) 
        
               for x in itr 
        
                   pred(x) && f(res, x) 
        
               end 
        
               res 
        
           end

At first, I thought that this method gets called, which is synonymous to your faster methods:

julia/base/array.jl

Line 2312 in 55e36cc

filter(f, As::AbstractArray) = As[map(f, As)::AbstractArray{Bool}]

But actually, it's this one:

julia/base/array.jl

Line 2351 in 55e36cc

filter(f, a::Vector) = mapfilter(f, push!, a, empty(a))

which leads to the first code quote. In this spirit, consider this:

julia> haystack = reshape(rand(100_000), 1,100000)
1×100000 Array{Float64,2}:
 0.420187  0.60593  0.512718  0.174329  …  0.657166  0.304862  0.223922
julia> @benchmark filter(x -> x < 0.5, $haystack)
BenchmarkTools.Trial: 
  memory estimate:  489.33 KiB
  allocs estimate:  8
  --------------
  minimum time:     442.609 μs (0.00% GC)
  median time:      461.997 μs (0.00% GC)
  mean time:        513.938 μs (2.02% GC)
  maximum time:     35.242 ms (98.61% GC)
  --------------
  samples:          9680
  evals/sample:     1

julia> @benchmark $haystack[map(x -> x < 0.5, $haystack)]
BenchmarkTools.Trial: 
  memory estimate:  489.28 KiB
  allocs estimate:  7
  --------------
  minimum time:     442.198 μs (0.00% GC)
  median time:      462.534 μs (0.00% GC)
  mean time:        517.566 μs (2.02% GC)
  maximum time:     32.510 ms (98.60% GC)
  --------------
  samples:          9610
  evals/sample:     1

Now you trigger the code that is synonymous to your faster method.

JeffBezanson · 2019-06-12T20:00:44Z

Yes, must be due to using repeated push!. That will be slower and use more memory due to growing the array. The other version needs a temporary Bool array, but it's 1/8th the size.

nickrobinson251 · 2019-06-13T17:46:50Z

Could filter(f, a::Vector) = mapfilter(f, push!, a, empty(a)) simply be removed, so we instead hit filter(f, As::AbstractArray) = As[map(f, As)::AbstractArray{Bool}]?

dkarrasch · 2019-06-13T20:41:20Z

I guess it depends on how often the boolean function returns true.

julia> using BenchmarkTools

julia> haystack = rand(100_000);

julia> @benchmark filter(x -> x < 5e-5, $haystack)
BenchmarkTools.Trial: 
  memory estimate:  128 bytes
  allocs estimate:  2
  --------------
  minimum time:     83.695 μs (0.00% GC)
  median time:      83.765 μs (0.00% GC)
  mean time:        91.617 μs (0.00% GC)
  maximum time:     2.148 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> haystack = reshape(haystack, 1,100000);

julia> @benchmark filter(x -> x < 5e-5, $haystack)
BenchmarkTools.Trial: 
  memory estimate:  97.97 KiB
  allocs estimate:  7
  --------------
  minimum time:     60.317 μs (0.00% GC)
  median time:      101.090 μs (0.00% GC)
  mean time:        124.740 μs (12.38% GC)
  maximum time:     51.757 ms (99.66% GC)
  --------------
  samples:          10000
  evals/sample:     1

So the mapfilter version is a bit faster, but not much.

dkarrasch · 2019-06-14T07:34:12Z

I just realized that there is WIP in #31929 regarding this exact issue. So I assume we can close this?

nickrobinson251 · 2019-06-16T09:28:59Z

Yes, this can be closed when that PR is merged. Thanks!

KristofferC · 2019-07-20T21:46:51Z

Which it now is.

ararslan added the performance Must go faster label Jun 12, 2019

dkarrasch mentioned this issue Jun 14, 2019

speed up filter of vector #31929

Merged

KristofferC closed this as completed Jul 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`filter` sometimes slower than going via `map` #32303

`filter` sometimes slower than going via `map` #32303

nickrobinson251 commented Jun 12, 2019

dkarrasch commented Jun 12, 2019

JeffBezanson commented Jun 12, 2019

nickrobinson251 commented Jun 13, 2019

dkarrasch commented Jun 13, 2019

dkarrasch commented Jun 14, 2019

nickrobinson251 commented Jun 16, 2019

KristofferC commented Jul 20, 2019

filter sometimes slower than going via map #32303

filter sometimes slower than going via map #32303

Comments

nickrobinson251 commented Jun 12, 2019

dkarrasch commented Jun 12, 2019

JeffBezanson commented Jun 12, 2019

nickrobinson251 commented Jun 13, 2019

dkarrasch commented Jun 13, 2019

dkarrasch commented Jun 14, 2019

nickrobinson251 commented Jun 16, 2019

KristofferC commented Jul 20, 2019

`filter` sometimes slower than going via `map` #32303

`filter` sometimes slower than going via `map` #32303