Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always keep missing in pool #65

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Always keep missing in pool #65

wants to merge 3 commits into from

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Apr 30, 2021

This is WIP to show the changes so that pools can be shared between PooledArrays allowing and disallowing missing values. It seems we can have them without much complications.

Please comment if we would be OK with this design. Then I will finalize if it is OK. In general this will have some small performance penalty (the usual cost of allowing Missing in eltype).

This is WIP to show the changes so that pools can be shared between `PooledArray`s allowing and disallowing missing values. It seems we can have them without much complications.
@@ -30,13 +30,13 @@ end

mutable struct PooledArray{T, R<:Integer, N, RA} <: AbstractArray{T, N}
refs::RA
pool::Vector{T}
invpool::Dict{T,R}
pool::Vector{Union{Missing,T}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of think we should just keep this T; and let Union{Missing, T} be provided as the T. That way if you don't have missing values, you're not penalized. It seems like to so simillar_missing, we could provide a function that just re-wraps the refs, pool, invpool, etc in a new PooledArray struct w/ Union{T, Missing}, right?

But maybe you're right that it would be too expensive to have to change pool::Vector{T} and invpool::Dict{T, R} to support missing as well in the similar_missing operation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to avoid having to pay the following cost:

julia> y = Dict{Int, Int}(1:10^8 .=> 1:10^8);

julia> @time convert(Dict{Union{Int, Missing}, Int}, y);
  3.684057 seconds (253.19 k allocations: 4.514 GiB, 6.35% gc time, 2.88% compilation time)

(it would also reduce the memory burden of making this operation)

With this PR it would be avoided.
My original preference was different, i.e. to never allow Missing in the refpool and invrefpool but have 0 to be a ref that is interpreted as missing without making a lookup, but @nalimilan thought it would be overly complex. That is why I have made a draft here to discuss the design.

This would have a consequence similar to what we have in CategoricalArrays.jl:

julia> x = categorical(["a"])
1-element CategoricalArray{String,1,UInt32}:
 "a"

julia> resize!(x, 2)
2-element CategoricalArray{String,1,UInt32}:
    "a"
 #undef

julia> convert(CategoricalArray{Union{String,Missing},1,UInt32}, x)
2-element CategoricalArray{Union{Missing, String},1,UInt32}:
 "a"
 missing

(so magically #undef becomes missing which is probably undesirable)

There is no rush to make a decision here. Let us think and decide what is best.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah neither solution is simple. Maybe it would be possible to make convert faster for dicts when the target eltype is a supertype of the source eltype?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT this has been an open issue in Julia Base for a long time. Who do you think is best to ask?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea. I guess we (you :-p) would have to come up with at least a draft PR to get people's attention. But this may be quite easy to do. convert calls this constructor when the origin type is the same as the destination:
https://github.com/JuliaLang/julia/blob/be6887fa9ffa3d08315c93f1551c41b44dd3d720/base/dict.jl#L92-L95
I wonder whether what's needed is just to call convert on the keys and value when the types don't match. What do you think?

Assuming that works, do you think it would be an acceptable replacement for this PR or would it still not be sufficient?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then we loose COW benefits. However, since we are not fully convinced what is best I would park this PR for now, so that @quinnj can finish CSV.jl update without this functionality (which is more pressing). We can go back to this PR and decide what to do later (especially that if we change the internals probably 2.0 release would be needed).

And maybe in the mean time what @nalimilan proposes to do in Julia Base would be implemented (I cannot do it as I do not know how to do it unfortunately as I know too little about internal design related to JuliaLang/julia#26681)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid the best person to implement no-copy convert for Array{Union{T, Missing}} is also @quinnj! :-D

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@quinnj - if you do not have time for this (which I assume is the case) - can you please point me to the part of the sources that would have to be changed? (I assume it is somewhere in the C codes)
Thank you!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the relevant code is in https://github.com/JuliaLang/julia/blob/master/src/array.c. In the jl_array_typetagdata function, it returns the "selector bytes" for an isbits-union array. This is where we lengthen the requested array size to account for selectory bytes. We also have to ensure the selector bytes are always initialized to zero, since otherwise you could get selector bytes pointing to something not just garbage, but totally wrong.

So the operation we'd need is convert(::Type{Vector{Union{T, S}}}, x::Vector{T}) where {T, S}, right? And we're ok if the two arrays potentially share data? So it seems we would add a check on the julia-side for if T is isbitstype and S is isbitstype, and if so, call a new array.c routine.

This is where it gets a little hairy though, because we have array_resize_buffer, but that takes an existing array to resize it. We'd need to use some of the same logic in that routine and make a new jl_convert_to_isbitsunion method that:

  • asserts the input array eltype is isbitstype
  • asserts new desired Union eltype is also isbitstype
  • checks that the existing array data isn't a->flags.isshared (means we can't resize it; like an mmap buffer)
  • then probably called the same gc method as this line to resize existing array data to new size (which is just old array size + length(old_array))
  • then I think we could just call jl_ptr_to_array_1d with our realloc'd pointer? that will take care of making the new array and using the realloc'd data pointer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - I was afraid it would not be that simple. What we need is only convert(::Type{Vector{Union{T, Missing}}}, x::Vector{T}) where {T} but T may be any type (also non-bits, e.g. String). So I guess it would require separate paths for T being bits type and being non-bits type.

However, this in general shows the following problem. Assume that we have x and y sharing the same data. Where eltype of x is T and of y is Union{Missing, T}. What will happen to x if we do y[1] = missing? My intuition that this operation is super problematic - right? (i.e. would lead to an undefined state of x[1])

So the conclusion is that these arrays may not share data. Therefore we have an easier task of adding an internal constructor in PooledArrays.jl for Dict type that instead of:

function Dict{K,V}(kv) where V where K
    h = Dict{K,V}()
    haslength(kv) && sizehint!(h, Int(length(kv))::Int)
    for (k,v) in kv
        h[k] = v
    end
    return h
end

would copy an internal structure of source Dict and only do promotion of eltype of keys field in the source dict.

So we should have a function like:

function widen_dict_missing(d::Dict{K,V}) where {K, V}
    @assert !(Missing <: K) # this is handled otherwise in PooledArrays.jl
    K′ = Union{K, Missing}
    return Dict{K′,V}(copy(d.slots), Vector{K′}(d.keys), copy(d.vals),
                      d.ndel, d.count, d.age, d.idxfloor, d.maxprobe)
end

and the performance would be:

julia> d = Dict(1:10^7 .=> 1:10^7);

julia> @time Dict{Union{Int,Missing}, Int}(d);
  0.282199 seconds (7 allocations: 288.001 MiB)

julia> @time Dict{Union{Int,Missing}, Int}(d);
  0.301356 seconds (7 allocations: 288.001 MiB)

julia> @time Dict{Union{Int,Missing}, Int}(d);
  0.313754 seconds (7 allocations: 288.001 MiB)

julia> @time widen_dict_missing(d);
  0.080447 seconds (7 allocations: 288.000 MiB)

julia> @time widen_dict_missing(d);
  0.078461 seconds (7 allocations: 288.000 MiB)

julia> @time widen_dict_missing(d);
  0.085039 seconds (7 allocations: 288.000 MiB)

the benefit would be ~3x speedup.

@@ -448,7 +468,8 @@ Base.convert(::Type{Array}, pa::PooledArray{T, R, N}) where {T, R, N} = convert(

Base.@propagate_inbounds function Base.getindex(A::PooledArray, I::Int)
idx = DataAPI.refarray(A)[I]
iszero(idx) && throw(UndefRefError())
# if eltype(A) does not allow Missing then also 1 is an error
idx <= !(eltype(A) <: Missing) && throw(UndefRefError())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I think something like the following should be faster (as it avoids the core subtyping algorithm):

@assert Base.nonmissingtype(T) === T || !isone(idx)

I think I like doing an @assert here instead since if idx == 1, then it means something is corrupt, not #undef.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do this also. I wanted to squeeze it into one check (to reduce the cost of getindex).

However, first we should decide if we go for the design I proposed here or the design similar to what CategoricalArrays.jl does (as the other approach does not have this issue).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other approach is having 0 ref be missing, rigth?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. The issue is the following:

julia> x = categorical(["a", "b"])
2-element CategoricalArray{String,1,UInt32}:
 "a"
 "b"

julia> y = categorical(["a", missing])
2-element CategoricalArray{Union{Missing, String},1,UInt32}:
 "a"
 missing

julia> xs = similar(x)
2-element CategoricalArray{String,1,UInt32}:
 #undef
 #undef

julia> ys = similar(y)
2-element CategoricalArray{Union{Missing, String},1,UInt32}:
 missing
 missing

so this would be breaking (i.e. it is impossible to have #undef in CategoricalArray if its eltype allows Missing).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and then you have:

julia> DataAPI.refpool(ys)
1-element CategoricalArrays.CategoricalRefPool{Union{Missing, CategoricalValue{String, UInt32}}, CategoricalPool{String, UInt32, CategoricalValue{String, UInt32}}} with indices 0:0:
 missing

which means one needs a custom type to handle non 1-based indexing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm, I don't think that's awful behavior, but indeed a bit inconsistent. I mean, not letting users get #undef isn't the worst, IMO, but it does feel like cheating a bit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we'd have a similar problem with DataAPI.refpool on PooledArray w/ no missing values, right? Because pool[1] would be missing, even though the PooledArray doesn't allow missings?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because pool[1] would be missing, even though the PooledArray doesn't allow missings?

The option is to have a custom refpool type (like CategoricalArrays.jl)

@quinnj
Copy link
Member

quinnj commented May 13, 2021

Hmmm, this approach may complicate construction a bit; for example:

function _invert(d::Dict{K,V}) where {K,V}
    d1 = Vector{K}(undef, length(d))
    for (k, v) in d
        d1[v] = k
    end
    return d1
end

it doesn't seem like we can rely on length(d) here reliably. i.e. they may pass in d that doesn't have missing in it. Or their ref codes might start at 1 even though that's reserved.

@bkamins
Copy link
Member Author

bkamins commented May 13, 2021

You mean when one would use PooledArray{T,R,N,RA}(rs::RefArray{RA}, invpool::Dict{T, R}) constructor?

I always thought it was kind of reserved for internal use only

@quinnj
Copy link
Member

quinnj commented May 13, 2021

Well, internal use and CSV.read 😛 . I think there may be other cases, like data ingestion, where this should be considered: i.e. the invpool is manually accumulated with ref codes, then at the end PooledArray is officially constructed. But that may still be unique to CSV.jl where performance is so critical; i.e. it would be too costly to rely on PooledArray setindex! in hot parsing loop.

I can ensure CSV.read does the right thing, so maybe not that big of a concern.

@quinnj
Copy link
Member

quinnj commented May 20, 2021

Ok, I'm finishing up the CSV refactor and I think the approach in this PR would work well for how things look on the CSV.jl side. That is, it's nice to have missing reserved as ref 1 as it simplifies a lot of column promotion stuff. It's also nice to just allocate the refpool as Dict{Union{String, Missing}, UInt32} and not have to try and convert it to non-missing at the end if no missings were parsed.

@bkamins
Copy link
Member Author

bkamins commented May 20, 2021

But is this PR and release of PooledArrays.jl then required for your changes, or it is OK to do it later? (I am asking, because @nalimilan was not convinced which approach would be best)

@bkamins
Copy link
Member Author

bkamins commented May 22, 2021

@quinnj - given you work with this branch I will finalize this PR so that it is not DRAFT. However, I still do not assume we would merge it, as we need to make sure this is a sound design (and I understand the reservations that @nalimilan has - in particular I will have to check if we do not have significant performance regressions anywhere).

@bkamins
Copy link
Member Author

bkamins commented May 22, 2021

OK. There are two issues with the approach proposed by me in this PR:

  1. map on PooledArray is problematic (this is fixable probably)
  2. in DataFrames.jl if PooledArray does not allow Missing (but we "technically" allow it) then groupby will be ~3 times slower, as we have to recompute the groupings (we need to drop the unused level that is occupied by missing that cannot be used)

@quinnj
Copy link
Member

quinnj commented May 23, 2021

Ok, and the alternative approach would be to have ref of 0 be treated as missing, like CategoricalArrays. Or perhaps some other creative approach like having a "global" pool that all PooledArrays could share or something like that.

@bkamins
Copy link
Member Author

bkamins commented May 23, 2021

The alternatives are (and I comment on the consequences):

  1. Use a fast copy to allow Missing - this is what @nalimilan preferred: it will then introduce some cost of copying
  2. Use 0 to signal missing like in CategoricalArray: this has two consequences: a) more complex code (but we know how to do it - it is already done in CategoricalArrays.jl), b) the #undef inconsistency between bits and non bits type
  3. Use global pool: actually this approach has the same problems that we have here, i.e.:
    a. it will be hard to have map implemented correctly
    b. groupby will be slow (here we have a tension between join and groupby speed - @nalimilan maybe we can think of a better way to drop unused levels in groupby?)

@nalimilan
Copy link
Member

@nalimilan maybe we can think of a better way to drop unused levels in groupby?)

If missing is in the first position and it's the only unused level, then I guess just subtracting 1 to group indices would be enough. But that requires a special path in DataFrames, and overall I'm still worried that this PR may introduce slowdowns and/or increase complexity in many places.

@bkamins
Copy link
Member Author

bkamins commented May 23, 2021

then I guess just subtracting 1 to group indices would be enough.

Yes - I have thought about it but it would require that in DataFrames.jl we:

  1. check the version of PooledArrays.jl used
  2. have a two special paths that are PooledArrays.jl specific

That is why I would be hesitant to do it

overall I'm still worried that this PR may introduce slowdowns and/or increase complexity in many places.

Unfortunately this is my general conclusion also (and that we should just use a faster way to create a new refpool/invrefpool). @quinnj - would you be OK in CSV.jl with not having this change?

@matthieugomez
Copy link

matthieugomez commented Sep 1, 2021

Not related to this particular user case, but I'd love it to see missing values encoded by a reference of zero. It makes it easier to obtain efficient code since pool elements can then share the same concrete type (I guess that's also why SentinelArrays was developed no?). And it is consistent with CategoricalArrays and the ref array used for grouping operations in DataFrames.

@nilshg
Copy link

nilshg commented Nov 17, 2023

This is quite a long discussion, apologies for not having read all but Bogumil pointed me here from a Slack thread where a user reported a performance hit in outerjoin from pooling which is probably related:

https://discourse.julialang.org/t/performance-report-effect-of-reading-csv-file-on-mergeing-two-dataframes/106304/9?u=nilshg

Here on a relatively small data set (~60k rows) with 3 pooled columns (String1, String7, String15) we find a 3x regression when outerjoining.

@bkamins
Copy link
Member Author

bkamins commented Nov 17, 2023

Yes, there are two issues as I explain in https://bkamins.github.io/julialang/2023/11/17/pooling.html:

  • pooling a value that already has a small memory footprint (like String1) is inefficient;
  • doing outerjoin changes pool and invpool as it requires allowing missing in them, which requires their re-creation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants