Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pool sharing with copy on write #56

Merged
merged 39 commits into from
Mar 1, 2021
Merged

Add pool sharing with copy on write #56

merged 39 commits into from
Mar 1, 2021

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Feb 20, 2021

Fixes #54

Could you please have a look at the core of the design proposal. If we are OK with it I will reimplement the rest, add tests, and update documentation.

src/PooledArrays.jl Outdated Show resolved Hide resolved
src/PooledArrays.jl Outdated Show resolved Hide resolved
src/PooledArrays.jl Outdated Show resolved Hide resolved
src/PooledArrays.jl Show resolved Hide resolved
src/PooledArrays.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Feb 20, 2021

Can you have a quick look if now it looks as a right direction? (I will then polish the PR)
I ensures that both getindex and copyto! will be fast in generic code we have in joining algorithm in DataFrames.jl

@codecov
Copy link

codecov bot commented Feb 20, 2021

Codecov Report

Merging #56 (3eb65e7) into main (32f1c2a) will increase coverage by 5.72%.
The diff coverage is 90.72%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #56      +/-   ##
==========================================
+ Coverage   79.41%   85.14%   +5.72%     
==========================================
  Files           1        1              
  Lines         204      249      +45     
==========================================
+ Hits          162      212      +50     
+ Misses         42       37       -5     
Impacted Files Coverage Δ
src/PooledArrays.jl 85.14% <90.72%> (+5.72%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 32f1c2a...3eb65e7. Read the comment docs.

@bkamins
Copy link
Member Author

bkamins commented Feb 20, 2021

Benchmarks are as follows:

This PR:

julia> x = PooledArray(1:10^7);

julia> @time x[1:10^6];
  0.005101 seconds (7 allocations: 7.630 MiB)

julia> @time x[:];
  0.045468 seconds (6 allocations: 76.294 MiB)

julia> y = similar(x);

julia> @time copy!(y, x);
  0.005369 seconds

julia> y = similar(x);

julia> @time copyto!(y, x);
  0.003873 seconds

julia> @time copyto!(y, x);
  0.003631 seconds

julia> y = similar(x);

julia> @time copyto!(y, 1, x, 1, 10^6);
  0.000406 seconds (1 allocation: 16 bytes)

julia> @time copyto!(y, 1, x, 1, 10^6);
  0.000405 seconds (1 allocation: 16 bytes)

main branch:

julia> x = PooledArray(1:10^7);

julia> @time x[1:10^6];
  0.260175 seconds (14 allocations: 288.109 MiB, 3.09% gc time)

julia> @time x[:];
  1.897181 seconds (100 allocations: 580.985 MiB)

julia> y = similar(x);

julia> @time copy!(y, x);
  1.941643 seconds (90 allocations: 542.838 MiB, 3.85% gc time)

julia> y = similar(x);

julia> @time copyto!(y, x);
  1.965071 seconds (90 allocations: 542.838 MiB, 1.09% gc time)

julia> @time copyto!(y, x);
  1.038685 seconds

julia> y = similar(x);

julia> @time copyto!(y, 1, x, 1, 10^6);
  0.095800 seconds (69 allocations: 58.837 MiB)

julia> @time copyto!(y, 1, x, 1, 10^6);
  0.065425 seconds (1 allocation: 16 bytes)

@bkamins bkamins marked this pull request as ready for review February 20, 2021 18:57
@bkamins
Copy link
Member Author

bkamins commented Feb 20, 2021

I need to add tests, but other than that this should be good for a review.

@bkamins
Copy link
Member Author

bkamins commented Feb 20, 2021

@nalimilan + @quinnj : do you think we should add a separate branch when we detect that Julia is run in a single thread (a quite common case I think) and then we can just ignore everything and never copy pool and invpool?

@nalimilan
Copy link
Member

Yes, looks quite good!

@nalimilan + @quinnj : do you think we should add a separate branch when we detect that Julia is run in a single thread (a quite common case I think) and then we can just ignore everything and never copy pool and invpool?

Good question. That makes the code harder to test so it's risky. Maybe better wait and see whether this would really matter in practice.

src/PooledArrays.jl Outdated Show resolved Hide resolved
src/PooledArrays.jl Outdated Show resolved Hide resolved
src/PooledArrays.jl Outdated Show resolved Hide resolved
src/PooledArrays.jl Outdated Show resolved Hide resolved
src/PooledArrays.jl Outdated Show resolved Hide resolved
src/PooledArrays.jl Outdated Show resolved Hide resolved
src/PooledArrays.jl Outdated Show resolved Hide resolved
src/PooledArrays.jl Outdated Show resolved Hide resolved
src/PooledArrays.jl Outdated Show resolved Hide resolved
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member Author

bkamins commented Feb 21, 2021

That makes the code harder to test so it's risky.

OK. Then I will leave it as is for now.

@bkamins
Copy link
Member Author

bkamins commented Feb 21, 2021

I will add tests for this functionality later as I have to think how to make them accurate since it is not easy.

What version of PooledArrays.jl this PR should introduce?

Base.convert(::Type{PooledArray{S,R,N}}, pa::PooledArray{T,R,N}) where {S,T,R<:Integer,N} =
PooledArray(RefArray(copy(pa.refs)), convert(Dict{S,R}, pa.invpool))
function Base.convert(::Type{PooledArray{S,R1,N}}, pa::PooledArray{T,R2,N}) where {S,T,R1<:Integer,R2<:Integer,N}
if S === T && R1 === R2
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not covered as it should never be true (we have a separate method for this - should I change this to @assert)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just remove the branch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - removed

src/PooledArrays.jl Outdated Show resolved Hide resolved
src/PooledArrays.jl Outdated Show resolved Hide resolved
src/PooledArrays.jl Show resolved Hide resolved
src/PooledArrays.jl Outdated Show resolved Hide resolved
src/PooledArrays.jl Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Show resolved Hide resolved
test/runtests.jl Show resolved Hide resolved
src/PooledArrays.jl Show resolved Hide resolved
Base.@propagate_inbounds function Base.isassigned(pa::PooledArray, I::Int...)
!iszero(pa.refs[I...])
if VERSION < v"1.1"
Base.@propagate_inbounds function Base.getindex(A::SubArray{T,D,P,I,true} ,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really untested?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think coverage is not run on Julia 1.0 and this is the reason. Without this definition tests were failing on Julia 1.0 though.

Copy link
Member Author

@bkamins bkamins Feb 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member Author

bkamins commented Feb 28, 2021

I will merge it tomorrow if there are no more comments and tag a release.

@test refcount(pat1)[] == 3

copy!(pat1, pav1)
@test pat1 == pav1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nalimilan - this line fails without the special method for Julia 1.0:

julia> pa = PooledArray(1:4)
4-element PooledArray{Int64,UInt32,1,Array{UInt32,1}}:
 1
 2
 3
 4

julia> pav1 = @view pa[2:3]
2-element view(::PooledArray{Int64,UInt32,1,Array{UInt32,1}}, 2:3) with eltype Int64:
 2
 3

julia> pav1[1]
ERROR: MethodError: getindex(::SubArray{Int64,1,PooledArray{Int64,UInt32,1,Array{UInt32,1}},Tuple{UnitRange{Int64}},true}, ::Int64) is ambiguous. Candidates:
  getindex(A::SubArray{#s22,N,#s20,I,L} where L where I where #s20<:PooledArray where #s22, I::Vararg{Int64,N}) where {T, N} in PooledArrays at /home/bkamins/.julia/dev/PooledArrays/src/PooledArrays.jl:476
  getindex(V::SubArray{T,N,P,I,true} where I<:Tuple{AbstractUnitRange,Vararg{Any,N} where N} where P where N where T, i::Int64) in Base at subarray.jl:234
Possible fix, define
  getindex(::SubArray{T,1,P,I,true} where I<:Tuple{AbstractUnitRange,Vararg{Any,N} where N} where P<:PooledArray where T, ::Int64)
Stacktrace:
 [1] top-level scope at none:0

@bkamins bkamins merged commit 040053c into main Mar 1, 2021
@bkamins bkamins deleted the bk/pool_sharing branch March 1, 2021 08:28
@bkamins
Copy link
Member Author

bkamins commented Mar 1, 2021

Thank you! Today I will test it in DataFrames.jl new joining code after tagging

@cwiese
Copy link

cwiese commented Oct 22, 2021

Upgrading PooledArrays for JuliaDB/Indexed Tables from v.0.5.2 cause 10x slow down in groupreduce. I suspect this thread safe copy. We are trying to update to utilized newest CSV.jl.

There is no way to switch it off?

@bkamins
Copy link
Member Author

bkamins commented Oct 23, 2021

Can you please open an issue with a reproducible code showing the problem?
In general pool sharing with copy on write should never degrade performance (it should only improve performance in cases when no copying of pool is needed), so this should be fixed, but we need a working code.

@cwiese
Copy link

cwiese commented Oct 24, 2021

there is so much going in when you do a groupreduce in IndexTables - I will try to reduce this to a PooledArray operation

@nalimilan
Copy link
Member

Also if you can confirm that the regression appeared between say 0.5.1 and 0.5.2 it would be useful to track down the problem. Do make sure that other packages use the same version.

@bkamins
Copy link
Member Author

bkamins commented Oct 24, 2021

@cwiese - you can edit appropriate methods in PooledArrays.jl that do pool copying to print some debugging information to see when/if things are invoked.

@cwiese
Copy link

cwiese commented Oct 24, 2021

Okay there is something else happening here - I will open an issue if I can narrow this down to PooledArrays. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PooledArray is potentially not thread safe performance of getindex
4 participants