Support mapreduce over dimensions with SkipMissing #28027

nalimilan · 2018-07-10T10:33:08Z

The first commit fixes things like mapreduce(cos, +, Union{Int,Missing}[1], dims=1). See #26709 and #27457. A deeper revamp is needed, but given that the current code doesn't make sense, that sounds like an improvement. EDIT: first commit moved to #28089.

The second commit adds an implementation of mapreduce over dimensions for SkipMissing{<:AbstractArray}, closely based on the existing AbstractArray methods. This allows doing things which currently don't work like mapreduce(skipmissing([1 2; missing 4]), dims=1) to compute column sums while skipping missing values.

Unfortunately, the simple idea I had of just wrapping the user-provided reduction operation in a function which skips missing values does not work, because _mapreducedim! calls mapreduce_impl for efficiency when possible, and it fails when a slice contains only missing values (cf. #27743). Also performance would probably be worse, at least for now.

I'm not really happy about the amount of duplication this requires, but that's the only solution I could find. Apart from the boilerplate dispatching methods, we could avoid duplicating _mapreducedim! and reducedim_initarray0 but that would introduce several not-so-pretty A isa SkipMissing checks. So maybe maintaining two very similar sets of functions in parallel isn't so bad. It should be quite easy to apply changes to both, as long as one doesn't forget that two methods exist.

Replaces #27818.

andreasnoack · 2018-07-10T10:40:38Z

#27845 might be touching some of the same code.

JeffBezanson · 2018-07-12T18:58:11Z

base/reducedim.jl

@@ -116,7 +116,7 @@ function _reducedim_init(f, op, fv, fop, A, region)
    if T !== Any && applicable(zero, T)
        x = f(zero(T))
        z = op(fv(x), fv(x))
-        Tr = typeof(z) == typeof(x) && !isbitstype(T) ? T : typeof(z)
+        Tr = typeof(z) <: T ? T : typeof(z)


JeffBezanson · 2018-07-12T18:58:25Z

Separate PRs, please.

nalimilan · 2018-07-12T19:36:37Z

Moved the first commit to #28089. I'll keep it here until it's merged since tests won't pass without it.

stillyslalom · 2018-07-13T08:02:19Z

This is a much better implementation & interface than #27818. Cheers!

Allows calling mapreduce and specialized functinos with the dims argument on SkipMissing objects. The implementation is copied on the generic methods, but missing values need to be handled directly inside functions for efficiency and because mapreduce_impl returns a Some object which needs special handling.

nalimilan · 2018-07-18T16:08:43Z

I've rebased and used the same approach as in #27845. Unfortunately, special code is still needed when the first slice contains missing values. Good to go?

Just to check the optimized plain mapreduce is still used:
@nanosoldier runbenchmarks("union", vs=":master")

nanosoldier · 2018-07-18T17:33:55Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

StefanKarpinski · 2018-07-19T18:29:54Z

Why does this need triage? Isn't it just an optimization? Nobody on the triage call really understands what's going on here...

nalimilan · 2018-07-19T18:52:33Z

This one is not an optimization, it adds support for something which isn't currently possible.

StefanKarpinski · 2018-07-19T19:33:38Z

Ok, that's a feature not a breaking change, so also doesn't need triage.

nalimilan · 2018-07-21T00:56:32Z

But somebody needs to decide whether it's OK to merge before 0.7. I think it's important given that missing values are a major feature of that release.

ararslan · 2018-07-21T02:04:05Z

If it passes tests and someone who knows things approves, I'd say merge whenever. ¯\_(ツ)_/¯

JeffBezanson · 2018-07-26T18:24:43Z

Since this seems to be non-breaking, it doesn't really require triage. If people generally approve of it it can be merged any time.

JeffBezanson · 2018-07-26T18:27:01Z

Though I'm a bit concerned about treating skipmissing(A::AbstractArray) as an n-d array, since it can't really act like one.

nalimilan · 2018-07-30T22:20:03Z

Yes but we don't really treat it like an AbstractArray, we just support reduction operations. The only alternative I can see is to add a keyword argument to all reduction functions, which requires duplicating a lot of code and is inconsistent with sum(skipmissing(x)). Do you see another solution for this essential feature?

nalimilan · 2018-09-12T07:44:19Z

Bump. Anybody willing to review this?

nalimilan · 2019-01-05T16:47:13Z

See mini Julep at #30606.

tkf · 2020-04-02T22:05:05Z

base/missing.jl

+            end
+        end
+    else
+        filled = fill!(similar(R, Bool), false)


If you use an external boolean array to keep the "stage" of reduction, using _InitialValue + BottomRF as now done in foldl might be simpler. However, this would need to use Array{Union{T,Missing,_InitialValue}} as the destination/state. It would be nice if there is an API to get Array{Union{T,Missing}} from Array{Union{T,Missing,_InitialValue}} by only re-writing the type tag part of the array. Is there such an API?

Actually we could simply use missing here to indicate a slice for which we haven't found a non-missing value yet. This is because in the end the initialized array should contain missing only if the slice contains only missing values, in which case an error is thrown.

(Note that the current code just returns a single value, which is the smallest/largest value in the whole array. This relies on the assumption that all entries in the array can be compared with <, which isn't necessarily the case. So returning the smallest/largest value for each slice would make sense, in which case the strategy I describe above should work.)

Regarding your question, I don't think such a conversion method exists, but there's a related issue: #26681.

nalimilan · 2019-02-11T08:46:46Z

base/missing.jl

+end
+
+# Iterate until we've encountered at least one non-missing value in each slice,
+# and return the min/max non-missing value of all clices


Suggested change

# and return the min/max non-missing value of all clices

# and return the min/max non-missing value of all slices

nalimilan · 2020-03-31T09:52:04Z

base/missing.jl

+            # non-missing value in each slice
+            if v0 === missing
+                v0 = nonmissingval(f, $f2, itr, R)
+                R = similar(A1, typeof(v0))


nalimilan · 2020-04-03T08:24:51Z

base/missing.jl

+            end
+        end
+    else
+        filled = fill!(similar(R, Bool), false)


Actually we could simply use missing here to indicate a slice for which we haven't found a non-missing value yet. This is because in the end the initialized array should contain missing only if the slice contains only missing values, in which case an error is thrown.

(Note that the current code just returns a single value, which is the smallest/largest value in the whole array. This relies on the assumption that all entries in the array can be compared with <, which isn't necessarily the case. So returning the smallest/largest value for each slice would make sense, in which case the strategy I describe above should work.)

Regarding your question, I don't think such a conversion method exists, but there's a related issue: #26681.

pdeffebach · 2020-11-04T17:30:36Z

Bumping this, as someone mentioned it on slack. I would assume this is too big a change to get in by the 1.6 feature freeze?

briochemc · 2023-01-20T02:04:35Z

Bump as I stumbled on this just now :)

Was this given up on?

nalimilan added domain:arrays [a, r, r, a, y, s] domain:missing data Base.missing and related functionality labels Jul 10, 2018

This was referenced Jul 10, 2018

Adjust initialization in maximum and minimum #27845

Merged

Test dimensional reduce with non-bitstype #27457

Merged

JeffBezanson reviewed Jul 12, 2018

View reviewed changes

nalimilan force-pushed the nl/mapreducedim branch from f17f4c0 to 5efb69b Compare July 18, 2018 16:05

Fix ambiguity

548e891

nalimilan added the status:triage This should be discussed on a triage call label Jul 19, 2018

StefanKarpinski removed the status:triage This should be discussed on a triage call label Jul 26, 2018

This was referenced Jan 5, 2019

Replacement for reducedim(X, dim, skipnull=true) JuliaData/Missings.jl#43

Open

Mini Julep: skipmissing indexing #30606

Open

nalimilan mentioned this pull request Feb 11, 2019

Faster mapreduce for Broadcasted #31020

Merged

nalimilan mentioned this pull request Jan 5, 2020

Transducer as an optimization: map, filter and flatten #33526

Merged

nalimilan mentioned this pull request Mar 31, 2020

Fix minimum/maximum over dimensions with missing values #35323

Merged

tkf reviewed Apr 2, 2020

View reviewed changes

nalimilan commented Apr 3, 2020

View reviewed changes

nalimilan mentioned this pull request Aug 6, 2020

Mean of an array with missing values does not work if the dims argument is provided JuliaStats/Statistics.jl#7

Open

oxinabox mentioned this pull request Mar 17, 2021

sum and mean of skipmissings don't accept the dims kwarg #40081

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support mapreduce over dimensions with SkipMissing #28027

Support mapreduce over dimensions with SkipMissing #28027

nalimilan commented Jul 10, 2018 •

edited

Loading

andreasnoack commented Jul 10, 2018

JeffBezanson Jul 12, 2018

JeffBezanson commented Jul 12, 2018

nalimilan commented Jul 12, 2018

stillyslalom commented Jul 13, 2018

nalimilan commented Jul 18, 2018

nanosoldier commented Jul 18, 2018

StefanKarpinski commented Jul 19, 2018 •

edited

Loading

nalimilan commented Jul 19, 2018

StefanKarpinski commented Jul 19, 2018

nalimilan commented Jul 21, 2018

ararslan commented Jul 21, 2018

JeffBezanson commented Jul 26, 2018

JeffBezanson commented Jul 26, 2018

nalimilan commented Jul 30, 2018

nalimilan commented Sep 12, 2018

nalimilan commented Jan 5, 2019

tkf Apr 2, 2020

nalimilan Apr 3, 2020

nalimilan Feb 11, 2019

nalimilan Mar 31, 2020

nalimilan Apr 3, 2020

pdeffebach commented Nov 4, 2020

briochemc commented Jan 20, 2023

	# and return the min/max non-missing value of all clices
	# and return the min/max non-missing value of all slices

Support mapreduce over dimensions with SkipMissing #28027

Are you sure you want to change the base?

Support mapreduce over dimensions with SkipMissing #28027

Conversation

nalimilan commented Jul 10, 2018 • edited Loading

andreasnoack commented Jul 10, 2018

JeffBezanson Jul 12, 2018

Choose a reason for hiding this comment

JeffBezanson commented Jul 12, 2018

nalimilan commented Jul 12, 2018

stillyslalom commented Jul 13, 2018

nalimilan commented Jul 18, 2018

nanosoldier commented Jul 18, 2018

StefanKarpinski commented Jul 19, 2018 • edited Loading

nalimilan commented Jul 19, 2018

StefanKarpinski commented Jul 19, 2018

nalimilan commented Jul 21, 2018

ararslan commented Jul 21, 2018

JeffBezanson commented Jul 26, 2018

JeffBezanson commented Jul 26, 2018

nalimilan commented Jul 30, 2018

nalimilan commented Sep 12, 2018

nalimilan commented Jan 5, 2019

tkf Apr 2, 2020

Choose a reason for hiding this comment

nalimilan Apr 3, 2020

Choose a reason for hiding this comment

nalimilan Feb 11, 2019

Choose a reason for hiding this comment

nalimilan Mar 31, 2020

Choose a reason for hiding this comment

nalimilan Apr 3, 2020

Choose a reason for hiding this comment

pdeffebach commented Nov 4, 2020

briochemc commented Jan 20, 2023

nalimilan commented Jul 10, 2018 •

edited

Loading

StefanKarpinski commented Jul 19, 2018 •

edited

Loading