Improving inference for NamedTuple with Union{T, Missing} fields #25925

nalimilan · 2018-02-07T13:26:58Z

(This issue as been discussed at length at JuliaData/Missings.jl#6, but since then changes have been applied to Base regarding the handling of Union{T, Missing}. I think it's the last essential change needed to handle missing values efficiently, and to be able to port Query.jl to the Missing type. See also this Discourse thread. Cc: @davidanthoff)

Since #25828, map and broadcast are relatively fast with Union{T, Missing} element types. However they still widen the element type progressively as new types are encountered, which implies copying the already processed values to an array with a wider element type. For collections of Union{T, Missing} elements, this means at most one copy will be made. But for collections of NamedTuple with Union{T, Missing} fields, a copy will have to be made every time a missing value is encountered for a given field, implying up to ~~2^N-1~~ (EDIT: actually, N-1) copies (with N the number of fields). Clearly that's not going to fly.

A solution to this problem would be to use inference to determine the element type in advance, and fall back to the current approach when the inferred type isn't useful (Any, or other abstract type which is considered as not precise enough). While this would work already for collections with Union{T, Missing} elements, it needs more changes to work with more complex cases, notably NamedTuple with Union{T, Missing} fields (but also any parametric type).

Let's take a concrete case which is relatively simple: a generator over a vector of NamedTuple elements which returns NamedTuple elements (possibly with some transformations in the middle, but better start without). We would like return_type for first to give a narrower upper bound for the element type than just Any: e.g. NamedTuple{(:a, :b), Tuple{<:Union{Int,Missing}, <:Union{Float64,Missing}}}. The code below shows that we're not very far from this, since the types of the fields are correctly inferred. But when combining the tuple type for the fields with the names into a NamedTuple type, the information is discarded and we get Core.SSAValue(48)::Type{NamedTuple{(:a, :b),_1}} where _1.

I'd appreciate any pointers regarding changes that are necessary to turn the where _1 part into where _1<:Tuple{Union{Missing, Int64},Union{Missing, Float64}}.

julia> x = NamedTuple{(:a, :b),Tuple{Union{Missing, Int64},Union{Missing, Float64}}}[(a=1, b=2.0), (a=missing, b=missing), (a=missing, b=3.0), (a=2, b=missing)];

julia> Core.Compiler.return_type(first, Tuple{typeof(x)})
NamedTuple{(:a, :b),Tuple{Union{Missing, Int64},Union{Missing, Float64}}}

julia> itr1 = ((i.a, i.b) for i in x);

# OK: precise, thanks to tuples covariance
julia> Core.Compiler.return_type(first, Tuple{typeof(itr1)})
Tuple{Union{Missing, Int64},Union{Missing, Float64}}

julia> itr2 = ((a=i.a, b=i.b) for i in x);

# No OK
julia> Core.Compiler.return_type(first, Tuple{typeof(itr2)})
Any

julia> @code_warntype first(itr2)
[...]
      Core.SSAValue(45) = (Base.getfield)(Core.SSAValue(29), :a)::Union{Missing, Int64}
      # meta: pop location
      # meta: location sysimg.jl getproperty 8
      Core.SSAValue(46) = (Base.getfield)(Core.SSAValue(29), :b)::Union{Missing, Float64}
      # meta: pop location
      Core.SSAValue(43) = (Core.tuple)(Core.SSAValue(45), Core.SSAValue(46))::Tuple{Union{Missing, Int64},Union{Missing, Float64}}
      # meta: location boot.jl Type 498
      Core.SSAValue(47) = (Core.typeof)(Core.SSAValue(43))::Type{#s54} where #s54<:Tuple{Union{Missing, Int64},Union{Missing, Float64}}
      Core.SSAValue(48) = (Core.apply_type)(Core.NamedTuple, (:a, :b), Core.SSAValue(47))::Type{NamedTuple{(:a, :b),_1}} where _1
      Core.SSAValue(49) = (Core.SSAValue(48))(Core.SSAValue(43))::Any
      # meta: pop locations (3)
      return Core.SSAValue(49)
  end::Any

The text was updated successfully, but these errors were encountered:

vtjnash · 2018-02-20T14:58:41Z

every time a missing value is encountered for a given field, implying up to 2^N-1 copies (with N the number of fields).

The copy is only made once for each column, so it should be approximately O(k) copies, where k is the number of columns. The value O(2^k) (only the highest order term is significant, so we drop the -1) represents instead the amount of work that needs to be done to compile this ahead-of-time and address the JIT pause.

vtjnash · 2018-02-20T15:06:29Z

changes that are necessary to turn

In base/compiler/tfuncs.jl, for the apply_type_tfunc, it's marked with:

       # These blocks improve type info but make compilation a bit slower.

nalimilan · 2018-02-20T15:42:18Z

Thanks for the pointers, but what is this supposed to do in practice? I've tried uncommenting these lines, but this doesn't appear to make any difference to the NamedTuple example I posted above.

nalimilan added compiler:inference Type inference missing data Base.missing and related functionality labels Feb 7, 2018

nalimilan mentioned this issue Feb 7, 2018

Use promote_typejoin for Tuple and NamedTuple promotion #25924

Closed

nalimilan mentioned this issue Feb 19, 2018

ensure promotion rules don't alter eltype values #26109

Merged

piever mentioned this issue Feb 24, 2018

Don't make join output DataValueArray unless there are NAs JuliaData/IndexedTables.jl#121

Merged

simonbyrne mentioned this issue Feb 15, 2019

type promotion of missing inside tuples #31077

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving inference for NamedTuple with Union{T, Missing} fields #25925

Improving inference for NamedTuple with Union{T, Missing} fields #25925

nalimilan commented Feb 7, 2018 •

edited

Loading

vtjnash commented Feb 20, 2018

vtjnash commented Feb 20, 2018

nalimilan commented Feb 20, 2018

Improving inference for NamedTuple with Union{T, Missing} fields #25925

Improving inference for NamedTuple with Union{T, Missing} fields #25925

Comments

nalimilan commented Feb 7, 2018 • edited Loading

vtjnash commented Feb 20, 2018

vtjnash commented Feb 20, 2018

nalimilan commented Feb 20, 2018

nalimilan commented Feb 7, 2018 •

edited

Loading