Always keep missing in pool #65

bkamins · 2021-04-30T07:24:05Z

This is WIP to show the changes so that pools can be shared between PooledArrays allowing and disallowing missing values. It seems we can have them without much complications.

Please comment if we would be OK with this design. Then I will finalize if it is OK. In general this will have some small performance penalty (the usual cost of allowing Missing in eltype).

This is WIP to show the changes so that pools can be shared between `PooledArray`s allowing and disallowing missing values. It seems we can have them without much complications.

quinnj · 2021-05-02T04:50:33Z

src/PooledArrays.jl

@@ -30,13 +30,13 @@ end

 mutable struct PooledArray{T, R<:Integer, N, RA} <: AbstractArray{T, N}
    refs::RA
-    pool::Vector{T}
-    invpool::Dict{T,R}
+    pool::Vector{Union{Missing,T}}


I kind of think we should just keep this T; and let Union{Missing, T} be provided as the T. That way if you don't have missing values, you're not penalized. It seems like to so simillar_missing, we could provide a function that just re-wraps the refs, pool, invpool, etc in a new PooledArray struct w/ Union{T, Missing}, right?

But maybe you're right that it would be too expensive to have to change pool::Vector{T} and invpool::Dict{T, R} to support missing as well in the similar_missing operation.

I am trying to avoid having to pay the following cost:

julia> y = Dict{Int, Int}(1:10^8 .=> 1:10^8); julia> @time convert(Dict{Union{Int, Missing}, Int}, y); 3.684057 seconds (253.19 k allocations: 4.514 GiB, 6.35% gc time, 2.88% compilation time)

(it would also reduce the memory burden of making this operation)

With this PR it would be avoided.
My original preference was different, i.e. to never allow Missing in the refpool and invrefpool but have 0 to be a ref that is interpreted as missing without making a lookup, but @nalimilan thought it would be overly complex. That is why I have made a draft here to discuss the design.

This would have a consequence similar to what we have in CategoricalArrays.jl:

julia> x = categorical(["a"]) 1-element CategoricalArray{String,1,UInt32}: "a" julia> resize!(x, 2) 2-element CategoricalArray{String,1,UInt32}: "a" #undef julia> convert(CategoricalArray{Union{String,Missing},1,UInt32}, x) 2-element CategoricalArray{Union{Missing, String},1,UInt32}: "a" missing

(so magically #undef becomes missing which is probably undesirable)

There is no rush to make a decision here. Let us think and decide what is best.

Yeah neither solution is simple. Maybe it would be possible to make convert faster for dicts when the target eltype is a supertype of the source eltype?

AFAICT this has been an open issue in Julia Base for a long time. Who do you think is best to ask?

No idea. I guess we (you :-p) would have to come up with at least a draft PR to get people's attention. But this may be quite easy to do. convert calls this constructor when the origin type is the same as the destination:
https://github.com/JuliaLang/julia/blob/be6887fa9ffa3d08315c93f1551c41b44dd3d720/base/dict.jl#L92-L95
I wonder whether what's needed is just to call convert on the keys and value when the types don't match. What do you think?

Assuming that works, do you think it would be an acceptable replacement for this PR or would it still not be sufficient?

But then we loose COW benefits. However, since we are not fully convinced what is best I would park this PR for now, so that @quinnj can finish CSV.jl update without this functionality (which is more pressing). We can go back to this PR and decide what to do later (especially that if we change the internals probably 2.0 release would be needed).

And maybe in the mean time what @nalimilan proposes to do in Julia Base would be implemented (I cannot do it as I do not know how to do it unfortunately as I know too little about internal design related to JuliaLang/julia#26681)

I'm afraid the best person to implement no-copy convert for Array{Union{T, Missing}} is also @quinnj! :-D

@quinnj - if you do not have time for this (which I assume is the case) - can you please point me to the part of the sources that would have to be changed? (I assume it is somewhere in the C codes)
Thank you!

All the relevant code is in https://github.com/JuliaLang/julia/blob/master/src/array.c. In the jl_array_typetagdata function, it returns the "selector bytes" for an isbits-union array. This is where we lengthen the requested array size to account for selectory bytes. We also have to ensure the selector bytes are always initialized to zero, since otherwise you could get selector bytes pointing to something not just garbage, but totally wrong.

So the operation we'd need is convert(::Type{Vector{Union{T, S}}}, x::Vector{T}) where {T, S}, right? And we're ok if the two arrays potentially share data? So it seems we would add a check on the julia-side for if T is isbitstype and S is isbitstype, and if so, call a new array.c routine.

This is where it gets a little hairy though, because we have array_resize_buffer, but that takes an existing array to resize it. We'd need to use some of the same logic in that routine and make a new jl_convert_to_isbitsunion method that:

asserts the input array eltype is isbitstype

asserts new desired Union eltype is also isbitstype

checks that the existing array data isn't a->flags.isshared (means we can't resize it; like an mmap buffer)

then probably called the same gc method as this line to resize existing array data to new size (which is just old array size + length(old_array))

then I think we could just call jl_ptr_to_array_1d with our realloc'd pointer? that will take care of making the new array and using the realloc'd data pointer

Yeah - I was afraid it would not be that simple. What we need is only convert(::Type{Vector{Union{T, Missing}}}, x::Vector{T}) where {T} but T may be any type (also non-bits, e.g. String). So I guess it would require separate paths for T being bits type and being non-bits type.

However, this in general shows the following problem. Assume that we have x and y sharing the same data. Where eltype of x is T and of y is Union{Missing, T}. What will happen to x if we do y[1] = missing? My intuition that this operation is super problematic - right? (i.e. would lead to an undefined state of x[1])

So the conclusion is that these arrays may not share data. Therefore we have an easier task of adding an internal constructor in PooledArrays.jl for Dict type that instead of:

function Dict{K,V}(kv) where V where K h = Dict{K,V}() haslength(kv) && sizehint!(h, Int(length(kv))::Int) for (k,v) in kv h[k] = v end return h end

would copy an internal structure of source Dict and only do promotion of eltype of keys field in the source dict.

So we should have a function like:

function widen_dict_missing(d::Dict{K,V}) where {K, V} @assert !(Missing <: K) # this is handled otherwise in PooledArrays.jl K′ = Union{K, Missing} return Dict{K′,V}(copy(d.slots), Vector{K′}(d.keys), copy(d.vals), d.ndel, d.count, d.age, d.idxfloor, d.maxprobe) end

and the performance would be:

julia> d = Dict(1:10^7 .=> 1:10^7); julia> @time Dict{Union{Int,Missing}, Int}(d); 0.282199 seconds (7 allocations: 288.001 MiB) julia> @time Dict{Union{Int,Missing}, Int}(d); 0.301356 seconds (7 allocations: 288.001 MiB) julia> @time Dict{Union{Int,Missing}, Int}(d); 0.313754 seconds (7 allocations: 288.001 MiB) julia> @time widen_dict_missing(d); 0.080447 seconds (7 allocations: 288.000 MiB) julia> @time widen_dict_missing(d); 0.078461 seconds (7 allocations: 288.000 MiB) julia> @time widen_dict_missing(d); 0.085039 seconds (7 allocations: 288.000 MiB)

the benefit would be ~3x speedup.

quinnj · 2021-05-12T21:52:33Z

src/PooledArrays.jl

@@ -448,7 +468,8 @@ Base.convert(::Type{Array}, pa::PooledArray{T, R, N}) where {T, R, N} = convert(

 Base.@propagate_inbounds function Base.getindex(A::PooledArray, I::Int)
    idx = DataAPI.refarray(A)[I]
-    iszero(idx) && throw(UndefRefError())
+    # if eltype(A) does not allow Missing then also 1 is an error
+    idx <= !(eltype(A) <: Missing) && throw(UndefRefError())


Hmmm, I think something like the following should be faster (as it avoids the core subtyping algorithm):

@assert Base.nonmissingtype(T) === T || !isone(idx)

I think I like doing an @assert here instead since if idx == 1, then it means something is corrupt, not #undef.

We could do this also. I wanted to squeeze it into one check (to reduce the cost of getindex).

However, first we should decide if we go for the design I proposed here or the design similar to what CategoricalArrays.jl does (as the other approach does not have this issue).

The other approach is having 0 ref be missing, rigth?

Right. The issue is the following:

julia> x = categorical(["a", "b"]) 2-element CategoricalArray{String,1,UInt32}: "a" "b" julia> y = categorical(["a", missing]) 2-element CategoricalArray{Union{Missing, String},1,UInt32}: "a" missing julia> xs = similar(x) 2-element CategoricalArray{String,1,UInt32}: #undef #undef julia> ys = similar(y) 2-element CategoricalArray{Union{Missing, String},1,UInt32}: missing missing

so this would be breaking (i.e. it is impossible to have #undef in CategoricalArray if its eltype allows Missing).

and then you have:

julia> DataAPI.refpool(ys) 1-element CategoricalArrays.CategoricalRefPool{Union{Missing, CategoricalValue{String, UInt32}}, CategoricalPool{String, UInt32, CategoricalValue{String, UInt32}}} with indices 0:0: missing

which means one needs a custom type to handle non 1-based indexing.

Hmmmm, I don't think that's awful behavior, but indeed a bit inconsistent. I mean, not letting users get #undef isn't the worst, IMO, but it does feel like cheating a bit.

I feel like we'd have a similar problem with DataAPI.refpool on PooledArray w/ no missing values, right? Because pool[1] would be missing, even though the PooledArray doesn't allow missings?

Because pool[1] would be missing, even though the PooledArray doesn't allow missings?

The option is to have a custom refpool type (like CategoricalArrays.jl)

quinnj · 2021-05-13T18:40:57Z

Hmmm, this approach may complicate construction a bit; for example:

function _invert(d::Dict{K,V}) where {K,V}
    d1 = Vector{K}(undef, length(d))
    for (k, v) in d
        d1[v] = k
    end
    return d1
end

it doesn't seem like we can rely on length(d) here reliably. i.e. they may pass in d that doesn't have missing in it. Or their ref codes might start at 1 even though that's reserved.

bkamins · 2021-05-13T18:46:54Z

You mean when one would use PooledArray{T,R,N,RA}(rs::RefArray{RA}, invpool::Dict{T, R}) constructor?

I always thought it was kind of reserved for internal use only

quinnj · 2021-05-13T18:51:14Z

Well, internal use and CSV.read 😛 . I think there may be other cases, like data ingestion, where this should be considered: i.e. the invpool is manually accumulated with ref codes, then at the end PooledArray is officially constructed. But that may still be unique to CSV.jl where performance is so critical; i.e. it would be too costly to rely on PooledArray setindex! in hot parsing loop.

I can ensure CSV.read does the right thing, so maybe not that big of a concern.

quinnj · 2021-05-20T20:43:52Z

Ok, I'm finishing up the CSV refactor and I think the approach in this PR would work well for how things look on the CSV.jl side. That is, it's nice to have missing reserved as ref 1 as it simplifies a lot of column promotion stuff. It's also nice to just allocate the refpool as Dict{Union{String, Missing}, UInt32} and not have to try and convert it to non-missing at the end if no missings were parsed.

bkamins · 2021-05-20T21:03:23Z

But is this PR and release of PooledArrays.jl then required for your changes, or it is OK to do it later? (I am asking, because @nalimilan was not convinced which approach would be best)

bkamins · 2021-05-22T21:23:37Z

@quinnj - given you work with this branch I will finalize this PR so that it is not DRAFT. However, I still do not assume we would merge it, as we need to make sure this is a sound design (and I understand the reservations that @nalimilan has - in particular I will have to check if we do not have significant performance regressions anywhere).

bkamins · 2021-05-22T22:12:06Z

OK. There are two issues with the approach proposed by me in this PR:

map on PooledArray is problematic (this is fixable probably)
in DataFrames.jl if PooledArray does not allow Missing (but we "technically" allow it) then groupby will be ~3 times slower, as we have to recompute the groupings (we need to drop the unused level that is occupied by missing that cannot be used)

quinnj · 2021-05-23T03:50:02Z

Ok, and the alternative approach would be to have ref of 0 be treated as missing, like CategoricalArrays. Or perhaps some other creative approach like having a "global" pool that all PooledArrays could share or something like that.

bkamins · 2021-05-23T07:09:15Z

The alternatives are (and I comment on the consequences):

Use a fast copy to allow Missing - this is what @nalimilan preferred: it will then introduce some cost of copying
Use 0 to signal missing like in CategoricalArray: this has two consequences: a) more complex code (but we know how to do it - it is already done in CategoricalArrays.jl), b) the #undef inconsistency between bits and non bits type
Use global pool: actually this approach has the same problems that we have here, i.e.:
a. it will be hard to have map implemented correctly
b. groupby will be slow (here we have a tension between join and groupby speed - @nalimilan maybe we can think of a better way to drop unused levels in groupby?)

nalimilan · 2021-05-23T20:55:04Z

@nalimilan maybe we can think of a better way to drop unused levels in groupby?)

If missing is in the first position and it's the only unused level, then I guess just subtracting 1 to group indices would be enough. But that requires a special path in DataFrames, and overall I'm still worried that this PR may introduce slowdowns and/or increase complexity in many places.

bkamins · 2021-05-23T21:48:11Z

then I guess just subtracting 1 to group indices would be enough.

Yes - I have thought about it but it would require that in DataFrames.jl we:

check the version of PooledArrays.jl used
have a two special paths that are PooledArrays.jl specific

That is why I would be hesitant to do it

overall I'm still worried that this PR may introduce slowdowns and/or increase complexity in many places.

Unfortunately this is my general conclusion also (and that we should just use a faster way to create a new refpool/invrefpool). @quinnj - would you be OK in CSV.jl with not having this change?

matthieugomez · 2021-09-01T14:21:27Z

Not related to this particular user case, but I'd love it to see missing values encoded by a reference of zero. It makes it easier to obtain efficient code since pool elements can then share the same concrete type (I guess that's also why SentinelArrays was developed no?). And it is consistent with CategoricalArrays and the ref array used for grouping operations in DataFrames.

nilshg · 2023-11-17T08:35:11Z

This is quite a long discussion, apologies for not having read all but Bogumil pointed me here from a Slack thread where a user reported a performance hit in outerjoin from pooling which is probably related:

https://discourse.julialang.org/t/performance-report-effect-of-reading-csv-file-on-mergeing-two-dataframes/106304/9?u=nilshg

Here on a relatively small data set (~60k rows) with 3 pooled columns (String1, String7, String15) we find a 3x regression when outerjoining.

bkamins · 2023-11-17T09:08:23Z

Yes, there are two issues as I explain in https://bkamins.github.io/julialang/2023/11/17/pooling.html:

pooling a value that already has a small memory footprint (like String1) is inefficient;
doing outerjoin changes pool and invpool as it requires allowing missing in them, which requires their re-creation.

Always keep missing in pool

858efc0

This is WIP to show the changes so that pools can be shared between `PooledArray`s allowing and disallowing missing values. It seems we can have them without much complications.

bkamins mentioned this pull request Apr 30, 2021

Redesing of PooledArray internals #64

Open

quinnj reviewed May 2, 2021

View reviewed changes

quinnj reviewed May 12, 2021

View reviewed changes

A few fixes

5ecc0bd

partial fix of the PR

e979ea6

bkamins mentioned this pull request Sep 6, 2021

implement reference offset #72

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always keep missing in pool #65

Always keep missing in pool #65

bkamins commented Apr 30, 2021

quinnj May 2, 2021

bkamins May 2, 2021

nalimilan May 2, 2021

bkamins May 2, 2021

nalimilan May 14, 2021

bkamins May 14, 2021

nalimilan May 17, 2021

bkamins May 17, 2021

quinnj May 17, 2021

bkamins May 17, 2021

quinnj May 12, 2021

bkamins May 13, 2021

quinnj May 13, 2021

bkamins May 13, 2021

bkamins May 13, 2021

quinnj May 13, 2021

quinnj May 13, 2021

bkamins May 13, 2021

quinnj commented May 13, 2021

bkamins commented May 13, 2021

quinnj commented May 13, 2021

quinnj commented May 20, 2021

bkamins commented May 20, 2021

bkamins commented May 22, 2021

bkamins commented May 22, 2021

quinnj commented May 23, 2021

bkamins commented May 23, 2021

nalimilan commented May 23, 2021

bkamins commented May 23, 2021

matthieugomez commented Sep 1, 2021 •

edited

Loading

nilshg commented Nov 17, 2023

bkamins commented Nov 17, 2023

Always keep missing in pool #65

Are you sure you want to change the base?

Always keep missing in pool #65

Conversation

bkamins commented Apr 30, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

quinnj commented May 13, 2021

bkamins commented May 13, 2021

quinnj commented May 13, 2021

quinnj commented May 20, 2021

bkamins commented May 20, 2021

bkamins commented May 22, 2021

bkamins commented May 22, 2021

quinnj commented May 23, 2021

bkamins commented May 23, 2021

nalimilan commented May 23, 2021

bkamins commented May 23, 2021

matthieugomez commented Sep 1, 2021 • edited Loading

nilshg commented Nov 17, 2023

bkamins commented Nov 17, 2023

matthieugomez commented Sep 1, 2021 •

edited

Loading