Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance of getindex #54

Closed
bkamins opened this issue Feb 3, 2021 · 10 comments · Fixed by #56
Closed

performance of getindex #54

bkamins opened this issue Feb 3, 2021 · 10 comments · Fixed by #56

Comments

@bkamins
Copy link
Member

bkamins commented Feb 3, 2021

A bottleneck in fast join in DataFrames.jl is:

PooledArray(RefArray(getindex(A.refs, I)), copy(A.invpool))

as copy(A.invpool) is very expensive for large pools (also it leaves pool much larger than needed usually). I see two options:

  1. do not copy the poll (and share it as in PooledArray we do not allow removing levels from pool)
  2. if the new array is short relative to the old one allocate a fresh pool

This case will hit us (and probably already significantly hits us but we did not know about it in H2O benchmarks which heavily use PooledArrays)

CC @nalimilan @quinnj

@bkamins
Copy link
Member Author

bkamins commented Feb 4, 2021

So my proposal is to define getindex for PooledArray.jl for vector case as:

Base.@propagate_inbounds function Base.getindex(A::PooledArray, I::Union{Real,AbstractVector}...)
    PooledArray(RefArray(getindex(A.refs, I...)), A.invpool, A.pool)
end

without copying invpool and pool.

The reason is that we currently do not provide a way to shrink pool in PooledArray so it is not a problem to share it.

Potential problems since we are post 1.0 release:

  • sharing pool and invpool is not thread safe (but we were not thread safe anyway)
  • someone might rely on the fact that they are not shared across arrays

Additionally we could add shink_pool (or something like this) method that would recode pool, and invpool, and refs to make the representation compact (this is useful in its own right).

Note that currently we just copy invpool so in the vector returned by getindex we have a full invpool from the source even if most of it is not used.

@nalimilan
Copy link
Member

So after discussing this on Slack, it seems that making getindex share the pool with the original array would be a good idea except for thread safety. Currently we're not thread safe, but at least you know that as long as you're not mutating a PooledArray from other threads, you can safely do anything on it from a given thread. If we changed getindex to share pools across arrays, mutating an array in one thread could give wrong results if another one writes to or even reads from another array.

I would say that it's too dangerous. Unfortunately it's not yet clear whether it's possible to ensure thread safety without killing performance of getindex. We could copy the pool when nthreads() > 1. That shouldn't affect user-visible behavior, but it could trigger unexpected bugs...

@bkamins
Copy link
Member Author

bkamins commented Feb 13, 2021

Assume x is PooledArray. I have investigated into it more it seems that:

y = similar(x)
copyto!(y, x)

is as fast as x[:] (so it will of course be faster if we take a smaller slice), so maybe we could generalize this? Any opinions?

Also note that:

julia> x = string.(1:10^7);

julia> @time pa = PooledArray(x);
  6.741751 seconds (982.11 k allocations: 630.399 MiB, 5.68% gc time)

julia> @time pa = PooledArray(x);
  6.771693 seconds (100 allocations: 580.985 MiB, 12.36% gc time)

julia> @time pa[:];
  6.193193 seconds (499.18 k allocations: 606.728 MiB)

julia> @time pa[:];
  5.948993 seconds (100 allocations: 580.985 MiB)

julia> @time ca = categorical(x);
 37.319457 seconds (21.70 M allocations: 3.182 GiB, 16.48% gc time)

julia> @time ca = categorical(x);
 41.931994 seconds (20.00 M allocations: 3.099 GiB, 27.95% gc time)

julia> @time ca[:];
  2.378682 seconds (10.13 M allocations: 710.606 MiB, 84.73% gc time)

julia> @time ca[:];
  2.630116 seconds (10.00 M allocations: 703.911 MiB, 89.17% gc time)

so there is a trade-off between construction and slicing. @nalimilan - do you know the reasons?

@nalimilan
Copy link
Member

Assume x is PooledArray. I have investigated into it more it seems that:

y = similar(x)
copyto!(y, x)

is as fast as x[:] (so it will of course be faster if we take a smaller slice), so maybe we could generalize this? Any opinions?

Sorry, I don't get this.

so there is a trade-off between construction and slicing. @nalimilan - do you know the reasons?

I don't think there's a trade-off, really. Maybe just a missing optimization in PooledArrays? For example, it doesn't define IndexStyle to be IndexLinear.

@bkamins
Copy link
Member Author

bkamins commented Feb 14, 2021

Sorry, I don't get this.

I mean this:

julia> using BenchmarkTools

julia> using PooledArrays

julia> x = PooledArray(string.(1:10^6));

julia> @benchmark $x[:]
BenchmarkTools.Trial: 
  memory estimate:  62.65 MiB
  allocs estimate:  78
  --------------
  minimum time:     402.977 ms (0.00% GC)
  median time:      457.770 ms (7.83% GC)
  mean time:        462.088 ms (10.46% GC)
  maximum time:     570.657 ms (27.85% GC)
  --------------
  samples:          11
  evals/sample:     1

julia> @benchmark copy!(similar($x), $x)
BenchmarkTools.Trial: 
  memory estimate:  62.65 MiB
  allocs estimate:  77
  --------------
  minimum time:     408.912 ms (0.00% GC)
  median time:      451.813 ms (8.15% GC)
  mean time:        462.571 ms (10.51% GC)
  maximum time:     574.612 ms (27.89% GC)
  --------------
  samples:          11
  evals/sample:     1

julia> @benchmark $x[1:1]
BenchmarkTools.Trial: 
  memory estimate:  33.63 MiB
  allocs estimate:  12
  --------------
  minimum time:     15.697 ms (0.00% GC)
  median time:      17.978 ms (0.00% GC)
  mean time:        31.866 ms (37.89% GC)
  maximum time:     163.571 ms (87.31% GC)
  --------------
  samples:          157
  evals/sample:     1

julia> @benchmark copyto!(similar($x, 1), 1, $x, 1, 1)
BenchmarkTools.Trial: 
  memory estimate:  816 bytes
  allocs estimate:  9
  --------------
  minimum time:     3.300 μs (0.00% GC)
  median time:      3.530 μs (0.00% GC)
  mean time:        3.706 μs (0.73% GC)
  maximum time:     275.115 μs (97.90% GC)
  --------------
  samples:          10000
  evals/sample:     8

Indeed:

Base.IndexStyle(::Type{<:PooledArray}) = IndexLinear()

is missing, and we should add it. However, I checked that adding it does not solve this issue.

@nalimilan
Copy link
Member

So you mean when the requested slice is short it would be faster to re-add levels that are actually used to an empty pool than to copy the full pool? Yeah in extreme cases like this it would help. What's hard to identify appropriate thresholds. I guess we can do as if levels are distributed randomly in slices so that the only parameter to take into account is the ratio between the length of the slice and the total number of levels?

BTW, it's a shame that we don't implement any optimized methods for copy! and copyto!.

@bkamins
Copy link
Member Author

bkamins commented Feb 14, 2021

Yes - this is roughly what I think we should do.

@bkamins
Copy link
Member Author

bkamins commented Feb 20, 2021

I will propose a PR for this. In the mean time I have spotted that the current code was already sharing invpool between different arrays in this case:

julia> x = PooledArray(1:3)
3-element PooledArray{Int64,UInt32,1,Array{UInt32,1}}:
 1
 2
 3

julia> y = reverse(x)
3-element PooledArray{Int64,UInt32,1,Array{UInt32,1}}:
 3
 2
 1

julia> y[1] = 4
4

julia> x
3-element PooledArray{Int64,UInt32,1,Array{UInt32,1}}:
 1
 2
 3

julia> x.invpool
Dict{Int64,UInt32} with 4 entries:
  4 => 0x00000004
  2 => 0x00000002
  3 => 0x00000003
  1 => 0x00000001

@nalimilan
Copy link
Member

Wow that's really bad.

@bkamins
Copy link
Member Author

bkamins commented Feb 20, 2021

It is only bad for threading fortunately and I guess currently it is a very rare use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants