Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to ignore NaN when calculating mean of an array #4552

Closed
zhujinxuan opened this issue Oct 17, 2013 · 28 comments
Closed

Fail to ignore NaN when calculating mean of an array #4552

zhujinxuan opened this issue Oct 17, 2013 · 28 comments
Labels
needs decision A decision on this change is needed
Milestone

Comments

@zhujinxuan
Copy link

I try to evalute mean() on an array containing some NaN, I cannot find a way to ignore the NaN.
And I fail to find any option like na.rm in R or any function like nanmean in matlab

julia> x=[1 2 NaN 4]
1x4 Array{Float64,2}:
 1.0  2.0  NaN  4.0

julia> mean(x)
NaN
@timholy
Copy link
Member

timholy commented Oct 17, 2013

mean(x[!isnan(x)]).

But a nanmean function could be written to be more efficient, because you could avoid allocating the bitarray. Remember that looping is fast in Julia, so this function has a nearly trivial implementation.

@kmsquire
Copy link
Member

You could also use a DataArray, which are part of DataFrames.jl, but which soon should be in their own package.

@jiahao
Copy link
Member

jiahao commented Oct 17, 2013

The mean(x) method that gets dispatched is simply sum(x)/length(x). Implementing something like nanmean would be a small part of a larger issue, which is whether we want all the statistical functions like sum, mean, var, etc. to have NaN-ignoring variants. In a way it's fine if we don't, since the current behavior is IEEE-compliant. But that is probably being pedantic in a way that is counterproductive, given the common convention to use NaNs in place of missing data.

I would personally prefer having NaN-ignoring variants dispatched without new names. Something like mean(x, :ignorenan) would be much cleaner than introducing a slew of NaN-ignoring functions that otherwise closely parallel the functionality of existing functions.

@zhujinxuan
Copy link
Author

@timholy
Thank you. But I have another problem.
What should I do if I want mean(x,3,na.rm=true) ?

@kmsquire
Thank you, but I cannot find statistical functions in DataFrames.jl, so my problem will be:
How can I do some statistics with DataArray?

@timholy
Copy link
Member

timholy commented Oct 18, 2013

It's definitely less trivial if you're wanting to compute means along particular dimension(s); clearly we should provide library functions for this type of thing.

The standard tool is reducedim, which you can read about. You could define

notnanplus(a, b) = isnan(b) ? a : a+b
notnancount(a, b) = a + !isnan(b)
function nanmean(A, region)
    s = reducedim(notnanplus, A, region, zero(eltype(A)))
    n = reducedim(notnancount, A, region, 0)
    s ./ n
end

This won't be ideally efficient because it takes two passes over the array. (You could write a better version using Cartesian.jl.)

But as one nice advantage over matlab, you can compute the mean over multiple dimensions at once; this gives you the correct answer, which is not true of the matlab version nanmean(nanmean(A, 1), 2).

@jiahao I illustrated by calling this nanmean but I am also fine with using keywords. However, you may know that there is a little bit of overhead to keywords; for very low-level functions there may be some merit in just using different function names. I'm fine either way, at least until someone runs into trouble with one approach vs the other.

@kmsquire, glad to hear that DataArray is going to be split out! I have been wanting a way to indicate bad pixels in images where the pixel value is encoded by integers (and hence NaN is not available). I may have some contributions (e.g., different encodings of NA pixels, such as using an outer-product).

@johnmyleswhite
Copy link
Member

Chiming in briefly: DataArray is already fully split out and feature complete, it's actually DataFrames that's holding us back since the dev branch called dataarray that treats DataArray as an external dependency has a few remaining bugs. If anyone wants to use DataArray and work on this, please go ahead. I believe @simonster or @nfoti have already done some of this.

@timholy
Copy link
Member

timholy commented Oct 18, 2013

Thanks, @johnmyleswhite.

@zhujinxuan
Copy link
Author

Thanks, @johnmyleswhite , @timholy , @kmsquire

I am also interested about whether the feature of nan-ignoring will be available in the next version of julia or the next version of DataFrames.

As I often encounter observation data with NA, I think this sort of function is very valueable for the scientists which deal with observations much.

@jiahao
Copy link
Member

jiahao commented Oct 20, 2013

I still feel that this is the kind of use case that motivated DataFrames.jl in the first place. This thread in julia-dev lays it out very clearly - there is even a discussion about the suitability of NaNs to mark missing data (in DataFrames, this is handled by a special NA value).

Given the back history, I think the right thing to do is to leave the core statistical functions as they are, and relegate NaN-ignoring applications to DataFrames.

@jiahao
Copy link
Member

jiahao commented Oct 20, 2013

Filing under 0.3 milestone if we decide to make this change.

@zhujinxuan
Copy link
Author

Thank you, @jiahao .
I understand.

@BobPortmann
Copy link
Contributor

Given the back history, I think the right thing to do is to leave the core statistical functions as they are, and relegate NaN-ignoring applications to DataFrames.

Do you mean DataFrames or DataArrays? It would be unfortunate to have to use DataFrames to get standard functions to ignore NaN or NA's. DataFrames are a bad fit for higher dimensional data.

I would be in favor of both a DIM (or region) input and a NaN keyword for all the standard statistical functions and then the DataArray versions could add a NA keyword as well. The NaN kw is really most useful when used in combination with a DIM (or region).

@johnmyleswhite
Copy link
Member

Yes, everything statistical should happen at the level of DataArrays, not DataFrames. That's part of the reason we split DataArrays into a separate package that can be used without DataFrames.

@jiahao
Copy link
Member

jiahao commented Oct 28, 2013

I think the right thing to do is to leave the Base implementations unchanged, and have applications which require missing data handling use the DataArrays package, which was designed to handle missing data.

@jiahao jiahao closed this as completed Oct 28, 2013
jiahao added a commit that referenced this issue Oct 28, 2013
Recommend users who need to handle missing data to use the DataArrays
package. [#4552]

TODO: Revisit documentation of functions currently not overloaded by
      DataArrays. [JuliaStats/DataArrays.jl#3]
@deszoeke
Copy link

The inability to ignore missing values easily with functions (e.g. nanmean or a mean variant) remains a thorn in the side of Julia 0.5.

I suggest another method or two for mean:
mean(x [,region] ; mask=msk)
where msk could be
-a Boolean array selecting valid (nonmissing) elements of x
-a function so that msk(x) generates such a Boolean array, e.g. mask=isfinite or (x -> x.!=-9999.0)
and elements of x are not averaged where mask evaluates to false, regardless of whether they take on special IEEE values, NaN, Inf, etc. This would be a flexible solution for a variety of missing data implementations.

The solution needs to be very simple to win over matlab users, for example. The user that knows the risks should be able to use IEEE NaN as missing (or any value) when they want to. This practice will continue until standard methods for dealing with missing data are widely adopted in many languages and data set formats. I have never had a problem with doing it.

The DataArrays package is an elegant solution, but requires converting (probably deep copying) any existing data sets that might have NaNs or designated missing or out-of-range values (e.g. -9999) into a new data type that is not widely recognized throughout Base by other languages, data set file formats, or useful packages like PyPlot (matplotlib). Scientific data sets are not yet available in a convention that reads cleanly into DataArray types, leaving the naive scientific user to write custom (inefficient!, buggy!) nanmean functions and kludges and/or fiddle with converting types (inefficient!, buggy!).

@mbauman
Copy link
Member

mbauman commented Nov 16, 2016

@deszoeke There's now also the NaNMath.jl package:

julia> using NaNMath
       NaNMath.mean([1., 2., NaN])
1.5

@deszoeke
Copy link

NaNMath arithemtic functions don't support specifying a region of dimensions.

This feature doesn't need to be in Base (I wish it were), but it should work like the functions in Base.

New function methods are required because mere wrappers on Base.mean functions can't efficiently change the accumulators to work as they do in NaNMath.

@nalimilan
Copy link
Member

With generators you can now do this efficiently and with a relatively short syntax:

julia> X = [1.0, NaN]
2-element Array{Float64,1}:
   1.0
 NaN  

julia> mean(x for x in X if !isnan(x))
1.0

Though I think having an argument in Base to skip NaN would make sense. NullableArrays (the replacement for DataArrays) use skipnull=true for that; unfortunately, that terminology doesn't really apply to NaN.

@StefanKarpinski
Copy link
Member

If we merge ! for function negation, then mean(filter(!isnan, X)) would work, with the only issue being that it's not lazy and creates a temporary array.

@deszoeke
Copy link

deszoeke commented Nov 17, 2016

Some tests

julia> X=ones(3,4,5)
julia> X[1:8:end]=NaN
julia> @time mean(x for x in X if !isnan(x))
  0.177737 seconds (108.66 k allocations: 4.557 MB)
1.0
julia> @time mean(filter(x -> !isnan(x),X)) # have to lambda
  0.050281 seconds (13.16 k allocations: 598.502 KB)
1.0

But I want to be able to average over particular dimensions. The simple competitive statement from matlab is Y=nanmean(X,dim), though as @timholy notes, nesting that to average over several dimensions doesn't do the same thing as averaging all values equally weighted. Combining comprehensions to select dimensions,

@time [ mean(x for x in X[:,:,k] if !isnan(x)) for k=1:size(X,3) ]
  0.248298 seconds (139.26 k allocations: 5.910 MB)
5-element Array{Float64,1}:
 1.0
 1.0
 1.0
 1.0
 1.0
@time [ mean(filter(x->!isnan(x), X[:,:,k])) for k=1:size(X,3) ]
  0.227202 seconds (84.39 k allocations: 3.815 MB)

It's a bit of a mouthful and at >0.2 s it seems quite slow. Maybe because comprehensions in the REPL have to compile or aren't inlined.

I like both approaches for their generality, but I suspect that test for missing values is faster to do at the level of the accumulator within mean.

@timholy
Copy link
Member

timholy commented Nov 20, 2016

A general tip is that when you want something for multidmensional arrays, when in doubt look at the Images.jl package:

julia> using Images

julia> A = rand(3,5)
3×5 Array{Float64,2}:
 0.621814  0.36342   0.308406  0.249511  0.866264
 0.844332  0.491472  0.960657  0.284304  0.291467
 0.315552  0.640316  0.404262  0.810443  0.910141

julia> A[1,2] = NaN
NaN

julia> meanfinite(A, 2)
3×1 Array{Float64,2}:
 0.511499
 0.574446
 0.616143

julia> meanfinite(A, 1)
1×5 Array{Float64,2}:
 0.593899  0.565894  0.557775  0.448086  0.68929

julia> A = rand(1000,1000); A[2:100:900, 5:80:200] = NaN;

julia> @time meanfinite(A, 1);
  0.002133 seconds (155 allocations: 32.188 KB)

julia> @time meanfinite(A, 2);
  0.000850 seconds (29 allocations: 24.625 KB)

@davidzentlermunro
Copy link

To confirm, is the quickest (ready made) way of ignoring Nans in a mean to use meanfinite in images? What about for calculating median while ignoring NaNs?

@nalimilan
Copy link
Member

nalimilan commented Feb 12, 2018

You can check it yourself whether it's as fast as meanfinite, but AFAICT mean(x for x in X if !isnan(x)) or mean(Iterators.filter(!isnan, x)) should be reasonably efficient (it doesn't work over dimensions though). For the median, median!(filter(!isnan, x)) shouldn't be too bad either.

@deszoeke
Copy link

deszoeke commented Mar 5, 2018

+1 for meanfinite in the Images.jl package.

It's unfortunate for Julia that really useful statistical operators are hidden away in unexpected places like the Images.jl package. I'm not doing image processing, so why would I look there?

Inspired by meanfinite I tried out functional programming to build functions from the separate functions of the inner operation (e.g. a product for covariance), the filtering (e.g. isfinite), and the accumulation (e.g. sum). I'm not very good with making packages, but might be able to pull a draft together if there is interest.

@deszoeke
Copy link

deszoeke commented May 17, 2018

See https://github.com/deszoeke/ConditionalMean.jl for a reworking of the flexibly-dimensioned summations in Images.meanfinite with an arbitrary condition (including ignoring sentinel values, i.e. NaN, -999). It also can average a callable function of the array.

nanmean() and nanstd() methods are provided.

Tests and pull requests are welcome.

@deszoeke
Copy link

deszoeke commented May 17, 2018

I would like compute the covariance matrix cov, ignoring missing values (which for me are NaNs). For now, I have to go back to MATLAB and use nancov. I have written MATLAB code to do this.

The unsettledness of standard(s) for how to handle missing data is an impediment to developing useful functions and methods. The long list of breaking, late-breaking, and deprecated ways to do this include NA, missing, Null, Unions thereof, and sentinel values for numeric types (e.g. NaN).

This diversity and changing architecture leads to severe usability problems. Searching the discussions, I finds different computer science and data science philosophies and practical reasons for one approach over another. I respect these arguments, but it's impossible to tell what works, what's deprecated, or broken.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented May 18, 2018

@deszoeke, this closed issue is not the place for discussing this. Please post questions, frustrations or package announcements to the Julia discourse discussion forum.

@nalimilan
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs decision A decision on this change is needed
Projects
None yet
Development

No branches or pull requests