You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(This issue as been discussed at length at JuliaData/Missings.jl#6, but since then changes have been applied to Base regarding the handling of Union{T, Missing}. I think it's the last essential change needed to handle missing values efficiently, and to be able to port Query.jl to the Missing type. See also this Discourse thread. Cc: @davidanthoff)
Since #25828, map and broadcast are relatively fast with Union{T, Missing} element types. However they still widen the element type progressively as new types are encountered, which implies copying the already processed values to an array with a wider element type. For collections of Union{T, Missing} elements, this means at most one copy will be made. But for collections of NamedTuple with Union{T, Missing} fields, a copy will have to be made every time a missing value is encountered for a given field, implying up to 2^N-1 (EDIT: actually, N-1) copies (with N the number of fields). Clearly that's not going to fly.
A solution to this problem would be to use inference to determine the element type in advance, and fall back to the current approach when the inferred type isn't useful (Any, or other abstract type which is considered as not precise enough). While this would work already for collections with Union{T, Missing} elements, it needs more changes to work with more complex cases, notably NamedTuple with Union{T, Missing} fields (but also any parametric type).
Let's take a concrete case which is relatively simple: a generator over a vector of NamedTuple elements which returns NamedTuple elements (possibly with some transformations in the middle, but better start without). We would like return_type for first to give a narrower upper bound for the element type than just Any: e.g. NamedTuple{(:a, :b), Tuple{<:Union{Int,Missing}, <:Union{Float64,Missing}}}. The code below shows that we're not very far from this, since the types of the fields are correctly inferred. But when combining the tuple type for the fields with the names into a NamedTuple type, the information is discarded and we get Core.SSAValue(48)::Type{NamedTuple{(:a, :b),_1}} where _1.
I'd appreciate any pointers regarding changes that are necessary to turn the where _1 part into where _1<:Tuple{Union{Missing, Int64},Union{Missing, Float64}}.
julia> x = NamedTuple{(:a, :b),Tuple{Union{Missing, Int64},Union{Missing, Float64}}}[(a=1, b=2.0), (a=missing, b=missing), (a=missing, b=3.0), (a=2, b=missing)];
julia> Core.Compiler.return_type(first, Tuple{typeof(x)})
NamedTuple{(:a, :b),Tuple{Union{Missing, Int64},Union{Missing, Float64}}}
julia> itr1 = ((i.a, i.b) for i in x);
# OK: precise, thanks to tuples covariance
julia> Core.Compiler.return_type(first, Tuple{typeof(itr1)})
Tuple{Union{Missing, Int64},Union{Missing, Float64}}
julia> itr2 = ((a=i.a, b=i.b) for i in x);
# No OK
julia> Core.Compiler.return_type(first, Tuple{typeof(itr2)})
Any
julia>@code_warntypefirst(itr2)
[...]
Core.SSAValue(45) = (Base.getfield)(Core.SSAValue(29), :a)::Union{Missing, Int64}# meta: pop location# meta: location sysimg.jl getproperty 8
Core.SSAValue(46) = (Base.getfield)(Core.SSAValue(29), :b)::Union{Missing, Float64}# meta: pop location
Core.SSAValue(43) = (Core.tuple)(Core.SSAValue(45), Core.SSAValue(46))::Tuple{Union{Missing, Int64},Union{Missing, Float64}}# meta: location boot.jl Type 498
Core.SSAValue(47) = (Core.typeof)(Core.SSAValue(43))::Type{#s54}where#s54<:Tuple{Union{Missing, Int64},Union{Missing, Float64}}
Core.SSAValue(48) = (Core.apply_type)(Core.NamedTuple, (:a, :b), Core.SSAValue(47))::Type{NamedTuple{(:a, :b),_1}}where _1
Core.SSAValue(49) = (Core.SSAValue(48))(Core.SSAValue(43))::Any# meta: pop locations (3)return Core.SSAValue(49)
end::Any
The text was updated successfully, but these errors were encountered:
every time a missing value is encountered for a given field, implying up to 2^N-1 copies (with N the number of fields).
The copy is only made once for each column, so it should be approximately O(k) copies, where k is the number of columns. The value O(2^k) (only the highest order term is significant, so we drop the -1) represents instead the amount of work that needs to be done to compile this ahead-of-time and address the JIT pause.
Thanks for the pointers, but what is this supposed to do in practice? I've tried uncommenting these lines, but this doesn't appear to make any difference to the NamedTuple example I posted above.
(This issue as been discussed at length at JuliaData/Missings.jl#6, but since then changes have been applied to Base regarding the handling of
Union{T, Missing}
. I think it's the last essential change needed to handle missing values efficiently, and to be able to port Query.jl to theMissing
type. See also this Discourse thread. Cc: @davidanthoff)Since #25828,
map
andbroadcast
are relatively fast withUnion{T, Missing}
element types. However they still widen the element type progressively as new types are encountered, which implies copying the already processed values to an array with a wider element type. For collections ofUnion{T, Missing}
elements, this means at most one copy will be made. But for collections ofNamedTuple
withUnion{T, Missing}
fields, a copy will have to be made every time a missing value is encountered for a given field, implying up to2^N-1(EDIT: actually, N-1) copies (with N the number of fields). Clearly that's not going to fly.A solution to this problem would be to use inference to determine the element type in advance, and fall back to the current approach when the inferred type isn't useful (
Any
, or other abstract type which is considered as not precise enough). While this would work already for collections withUnion{T, Missing}
elements, it needs more changes to work with more complex cases, notablyNamedTuple
withUnion{T, Missing}
fields (but also any parametric type).Let's take a concrete case which is relatively simple: a generator over a vector of
NamedTuple
elements which returnsNamedTuple
elements (possibly with some transformations in the middle, but better start without). We would likereturn_type
forfirst
to give a narrower upper bound for the element type than justAny
: e.g.NamedTuple{(:a, :b), Tuple{<:Union{Int,Missing}, <:Union{Float64,Missing}}}
. The code below shows that we're not very far from this, since the types of the fields are correctly inferred. But when combining the tuple type for the fields with the names into aNamedTuple
type, the information is discarded and we getCore.SSAValue(48)::Type{NamedTuple{(:a, :b),_1}} where _1
.I'd appreciate any pointers regarding changes that are necessary to turn the
where _1
part intowhere _1<:Tuple{Union{Missing, Int64},Union{Missing, Float64}}
.The text was updated successfully, but these errors were encountered: