Fail to ignore NaN when calculating mean of an array #4552

zhujinxuan · 2013-10-17T08:40:55Z

I try to evalute mean() on an array containing some NaN, I cannot find a way to ignore the NaN.
And I fail to find any option like na.rm in R or any function like nanmean in matlab

julia> x=[1 2 NaN 4]
1x4 Array{Float64,2}:
 1.0  2.0  NaN  4.0

julia> mean(x)
NaN

timholy · 2013-10-17T10:56:37Z

mean(x[!isnan(x)]).

But a nanmean function could be written to be more efficient, because you could avoid allocating the bitarray. Remember that looping is fast in Julia, so this function has a nearly trivial implementation.

kmsquire · 2013-10-17T17:05:29Z

You could also use a DataArray, which are part of DataFrames.jl, but which soon should be in their own package.

jiahao · 2013-10-17T21:57:07Z

The mean(x) method that gets dispatched is simply sum(x)/length(x). Implementing something like nanmean would be a small part of a larger issue, which is whether we want all the statistical functions like sum, mean, var, etc. to have NaN-ignoring variants. In a way it's fine if we don't, since the current behavior is IEEE-compliant. But that is probably being pedantic in a way that is counterproductive, given the common convention to use NaNs in place of missing data.

I would personally prefer having NaN-ignoring variants dispatched without new names. Something like mean(x, :ignorenan) would be much cleaner than introducing a slew of NaN-ignoring functions that otherwise closely parallel the functionality of existing functions.

zhujinxuan · 2013-10-18T07:45:42Z

@timholy
Thank you. But I have another problem.
What should I do if I want mean(x,3,na.rm=true) ?

@kmsquire
Thank you, but I cannot find statistical functions in DataFrames.jl, so my problem will be:
How can I do some statistics with DataArray?

timholy · 2013-10-18T11:11:56Z

It's definitely less trivial if you're wanting to compute means along particular dimension(s); clearly we should provide library functions for this type of thing.

The standard tool is reducedim, which you can read about. You could define

notnanplus(a, b) = isnan(b) ? a : a+b
notnancount(a, b) = a + !isnan(b)
function nanmean(A, region)
    s = reducedim(notnanplus, A, region, zero(eltype(A)))
    n = reducedim(notnancount, A, region, 0)
    s ./ n
end

This won't be ideally efficient because it takes two passes over the array. (You could write a better version using Cartesian.jl.)

But as one nice advantage over matlab, you can compute the mean over multiple dimensions at once; this gives you the correct answer, which is not true of the matlab version nanmean(nanmean(A, 1), 2).

@jiahao I illustrated by calling this nanmean but I am also fine with using keywords. However, you may know that there is a little bit of overhead to keywords; for very low-level functions there may be some merit in just using different function names. I'm fine either way, at least until someone runs into trouble with one approach vs the other.

@kmsquire, glad to hear that DataArray is going to be split out! I have been wanting a way to indicate bad pixels in images where the pixel value is encoded by integers (and hence NaN is not available). I may have some contributions (e.g., different encodings of NA pixels, such as using an outer-product).

johnmyleswhite · 2013-10-18T13:44:04Z

Chiming in briefly: DataArray is already fully split out and feature complete, it's actually DataFrames that's holding us back since the dev branch called dataarray that treats DataArray as an external dependency has a few remaining bugs. If anyone wants to use DataArray and work on this, please go ahead. I believe @simonster or @nfoti have already done some of this.

timholy · 2013-10-18T14:16:03Z

Thanks, @johnmyleswhite.

zhujinxuan · 2013-10-19T01:11:10Z

Thanks, @johnmyleswhite , @timholy , @kmsquire

I am also interested about whether the feature of nan-ignoring will be available in the next version of julia or the next version of DataFrames.

As I often encounter observation data with NA, I think this sort of function is very valueable for the scientists which deal with observations much.

jiahao · 2013-10-20T19:20:23Z

I still feel that this is the kind of use case that motivated DataFrames.jl in the first place. This thread in julia-dev lays it out very clearly - there is even a discussion about the suitability of NaNs to mark missing data (in DataFrames, this is handled by a special NA value).

Given the back history, I think the right thing to do is to leave the core statistical functions as they are, and relegate NaN-ignoring applications to DataFrames.

jiahao · 2013-10-20T19:20:52Z

Filing under 0.3 milestone if we decide to make this change.

zhujinxuan · 2013-10-21T02:12:44Z

Thank you, @jiahao .
I understand.

BobPortmann · 2013-10-21T04:10:23Z

Given the back history, I think the right thing to do is to leave the core statistical functions as they are, and relegate NaN-ignoring applications to DataFrames.

Do you mean DataFrames or DataArrays? It would be unfortunate to have to use DataFrames to get standard functions to ignore NaN or NA's. DataFrames are a bad fit for higher dimensional data.

I would be in favor of both a DIM (or region) input and a NaN keyword for all the standard statistical functions and then the DataArray versions could add a NA keyword as well. The NaN kw is really most useful when used in combination with a DIM (or region).

johnmyleswhite · 2013-10-21T15:26:03Z

Yes, everything statistical should happen at the level of DataArrays, not DataFrames. That's part of the reason we split DataArrays into a separate package that can be used without DataFrames.

jiahao · 2013-10-28T19:29:57Z

I think the right thing to do is to leave the Base implementations unchanged, and have applications which require missing data handling use the DataArrays package, which was designed to handle missing data.

Recommend users who need to handle missing data to use the DataArrays package. [#4552] TODO: Revisit documentation of functions currently not overloaded by DataArrays. [JuliaStats/DataArrays.jl#3]

deszoeke · 2016-11-16T18:29:26Z

The inability to ignore missing values easily with functions (e.g. nanmean or a mean variant) remains a thorn in the side of Julia 0.5.

I suggest another method or two for mean:
mean(x [,region] ; mask=msk)
where msk could be
-a Boolean array selecting valid (nonmissing) elements of x
-a function so that msk(x) generates such a Boolean array, e.g. mask=isfinite or (x -> x.!=-9999.0)
and elements of x are not averaged where mask evaluates to false, regardless of whether they take on special IEEE values, NaN, Inf, etc. This would be a flexible solution for a variety of missing data implementations.

The solution needs to be very simple to win over matlab users, for example. The user that knows the risks should be able to use IEEE NaN as missing (or any value) when they want to. This practice will continue until standard methods for dealing with missing data are widely adopted in many languages and data set formats. I have never had a problem with doing it.

The DataArrays package is an elegant solution, but requires converting (probably deep copying) any existing data sets that might have NaNs or designated missing or out-of-range values (e.g. -9999) into a new data type that is not widely recognized throughout Base by other languages, data set file formats, or useful packages like PyPlot (matplotlib). Scientific data sets are not yet available in a convention that reads cleanly into DataArray types, leaving the naive scientific user to write custom (inefficient!, buggy!) nanmean functions and kludges and/or fiddle with converting types (inefficient!, buggy!).

mbauman · 2016-11-16T18:43:19Z

@deszoeke There's now also the NaNMath.jl package:

julia> using NaNMath
       NaNMath.mean([1., 2., NaN])
1.5

deszoeke · 2016-11-16T19:52:26Z

NaNMath arithemtic functions don't support specifying a region of dimensions.

This feature doesn't need to be in Base (I wish it were), but it should work like the functions in Base.

New function methods are required because mere wrappers on Base.mean functions can't efficiently change the accumulators to work as they do in NaNMath.

nalimilan · 2016-11-16T21:51:43Z

With generators you can now do this efficiently and with a relatively short syntax:

julia> X = [1.0, NaN]
2-element Array{Float64,1}:
   1.0
 NaN  

julia> mean(x for x in X if !isnan(x))
1.0

Though I think having an argument in Base to skip NaN would make sense. NullableArrays (the replacement for DataArrays) use skipnull=true for that; unfortunately, that terminology doesn't really apply to NaN.

StefanKarpinski · 2016-11-16T23:30:10Z

If we merge ! for function negation, then mean(filter(!isnan, X)) would work, with the only issue being that it's not lazy and creates a temporary array.

deszoeke · 2016-11-17T00:03:14Z

Some tests

julia> X=ones(3,4,5)
julia> X[1:8:end]=NaN
julia> @time mean(x for x in X if !isnan(x))
  0.177737 seconds (108.66 k allocations: 4.557 MB)
1.0
julia> @time mean(filter(x -> !isnan(x),X)) # have to lambda
  0.050281 seconds (13.16 k allocations: 598.502 KB)
1.0

But I want to be able to average over particular dimensions. The simple competitive statement from matlab is Y=nanmean(X,dim), though as @timholy notes, nesting that to average over several dimensions doesn't do the same thing as averaging all values equally weighted. Combining comprehensions to select dimensions,

@time [ mean(x for x in X[:,:,k] if !isnan(x)) for k=1:size(X,3) ]
  0.248298 seconds (139.26 k allocations: 5.910 MB)
5-element Array{Float64,1}:
 1.0
 1.0
 1.0
 1.0
 1.0
@time [ mean(filter(x->!isnan(x), X[:,:,k])) for k=1:size(X,3) ]
  0.227202 seconds (84.39 k allocations: 3.815 MB)

It's a bit of a mouthful and at >0.2 s it seems quite slow. Maybe because comprehensions in the REPL have to compile or aren't inlined.

I like both approaches for their generality, but I suspect that test for missing values is faster to do at the level of the accumulator within mean.

timholy · 2016-11-20T21:48:32Z

A general tip is that when you want something for multidmensional arrays, when in doubt look at the Images.jl package:

julia> using Images

julia> A = rand(3,5)
3×5 Array{Float64,2}:
 0.621814  0.36342   0.308406  0.249511  0.866264
 0.844332  0.491472  0.960657  0.284304  0.291467
 0.315552  0.640316  0.404262  0.810443  0.910141

julia> A[1,2] = NaN
NaN

julia> meanfinite(A, 2)
3×1 Array{Float64,2}:
 0.511499
 0.574446
 0.616143

julia> meanfinite(A, 1)
1×5 Array{Float64,2}:
 0.593899  0.565894  0.557775  0.448086  0.68929

julia> A = rand(1000,1000); A[2:100:900, 5:80:200] = NaN;

julia> @time meanfinite(A, 1);
  0.002133 seconds (155 allocations: 32.188 KB)

julia> @time meanfinite(A, 2);
  0.000850 seconds (29 allocations: 24.625 KB)

davidzentlermunro · 2018-02-12T12:35:25Z

To confirm, is the quickest (ready made) way of ignoring Nans in a mean to use meanfinite in images? What about for calculating median while ignoring NaNs?

nalimilan · 2018-02-12T12:52:09Z

You can check it yourself whether it's as fast as meanfinite, but AFAICT mean(x for x in X if !isnan(x)) or mean(Iterators.filter(!isnan, x)) should be reasonably efficient (it doesn't work over dimensions though). For the median, median!(filter(!isnan, x)) shouldn't be too bad either.

deszoeke · 2018-03-05T19:35:13Z

+1 for meanfinite in the Images.jl package.

It's unfortunate for Julia that really useful statistical operators are hidden away in unexpected places like the Images.jl package. I'm not doing image processing, so why would I look there?

Inspired by meanfinite I tried out functional programming to build functions from the separate functions of the inner operation (e.g. a product for covariance), the filtering (e.g. isfinite), and the accumulation (e.g. sum). I'm not very good with making packages, but might be able to pull a draft together if there is interest.

deszoeke · 2018-05-17T15:51:51Z

See https://github.com/deszoeke/ConditionalMean.jl for a reworking of the flexibly-dimensioned summations in Images.meanfinite with an arbitrary condition (including ignoring sentinel values, i.e. NaN, -999). It also can average a callable function of the array.

nanmean() and nanstd() methods are provided.

Tests and pull requests are welcome.

deszoeke · 2018-05-17T16:02:14Z

I would like compute the covariance matrix cov, ignoring missing values (which for me are NaNs). ~~For now, I have to go back to MATLAB and use nancov.~~ I have written MATLAB code to do this.

The unsettledness of standard(s) for how to handle missing data is an impediment to developing useful functions and methods. The long list of breaking, late-breaking, and deprecated ways to do this include NA, missing, Null, Unions thereof, and sentinel values for numeric types (e.g. NaN).

This diversity and changing architecture leads to severe usability problems. Searching the discussions, I finds different computer science and data science philosophies and practical reasons for one approach over another. I respect these arguments, but it's impossible to tell what works, what's deprecated, or broken.

StefanKarpinski · 2018-05-18T17:09:37Z

@deszoeke, this closed issue is not the place for discussing this. Please post questions, frustrations or package announcements to the Julia discourse discussion forum.

nalimilan · 2018-05-18T17:45:47Z

See https://discourse.julialang.org/t/dataframes-0-11-released/7296/ and JuliaLang/www_old.julialang.org#770.

jiahao closed this as completed Oct 28, 2013

jiahao mentioned this issue Oct 28, 2013

Support other statistics functions in Base JuliaStats/DataArrays.jl#3

Open

11 tasks

jiahao mentioned this issue Dec 21, 2013

maximum function not NaN aware #5207

Closed

jiahao mentioned this issue Apr 10, 2014

Inconsistent handling of NaNs by median and mean functions #6486

Closed

nalimilan mentioned this issue May 8, 2019

ignoring elements with 0 weight JuliaStats/StatsBase.jl#492

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail to ignore NaN when calculating mean of an array #4552

Fail to ignore NaN when calculating mean of an array #4552

zhujinxuan commented Oct 17, 2013

timholy commented Oct 17, 2013

kmsquire commented Oct 17, 2013

jiahao commented Oct 17, 2013

zhujinxuan commented Oct 18, 2013

timholy commented Oct 18, 2013

johnmyleswhite commented Oct 18, 2013

timholy commented Oct 18, 2013

zhujinxuan commented Oct 19, 2013

jiahao commented Oct 20, 2013

jiahao commented Oct 20, 2013

zhujinxuan commented Oct 21, 2013

BobPortmann commented Oct 21, 2013

johnmyleswhite commented Oct 21, 2013

jiahao commented Oct 28, 2013

deszoeke commented Nov 16, 2016

mbauman commented Nov 16, 2016

deszoeke commented Nov 16, 2016

nalimilan commented Nov 16, 2016

StefanKarpinski commented Nov 16, 2016

deszoeke commented Nov 17, 2016 •

edited

Loading

timholy commented Nov 20, 2016

davidzentlermunro commented Feb 12, 2018

nalimilan commented Feb 12, 2018 •

edited

Loading

deszoeke commented Mar 5, 2018 •

edited

Loading

deszoeke commented May 17, 2018 •

edited

Loading

deszoeke commented May 17, 2018 •

edited

Loading

StefanKarpinski commented May 18, 2018 •

edited

Loading

nalimilan commented May 18, 2018

Fail to ignore NaN when calculating mean of an array #4552

Fail to ignore NaN when calculating mean of an array #4552

Comments

zhujinxuan commented Oct 17, 2013

timholy commented Oct 17, 2013

kmsquire commented Oct 17, 2013

jiahao commented Oct 17, 2013

zhujinxuan commented Oct 18, 2013

timholy commented Oct 18, 2013

johnmyleswhite commented Oct 18, 2013

timholy commented Oct 18, 2013

zhujinxuan commented Oct 19, 2013

jiahao commented Oct 20, 2013

jiahao commented Oct 20, 2013

zhujinxuan commented Oct 21, 2013

BobPortmann commented Oct 21, 2013

johnmyleswhite commented Oct 21, 2013

jiahao commented Oct 28, 2013

deszoeke commented Nov 16, 2016

mbauman commented Nov 16, 2016

deszoeke commented Nov 16, 2016

nalimilan commented Nov 16, 2016

StefanKarpinski commented Nov 16, 2016

deszoeke commented Nov 17, 2016 • edited Loading

timholy commented Nov 20, 2016

davidzentlermunro commented Feb 12, 2018

nalimilan commented Feb 12, 2018 • edited Loading

deszoeke commented Mar 5, 2018 • edited Loading

deszoeke commented May 17, 2018 • edited Loading

deszoeke commented May 17, 2018 • edited Loading

StefanKarpinski commented May 18, 2018 • edited Loading

nalimilan commented May 18, 2018

deszoeke commented Nov 17, 2016 •

edited

Loading

nalimilan commented Feb 12, 2018 •

edited

Loading

deszoeke commented Mar 5, 2018 •

edited

Loading

deszoeke commented May 17, 2018 •

edited

Loading

deszoeke commented May 17, 2018 •

edited

Loading

StefanKarpinski commented May 18, 2018 •

edited

Loading