-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail to ignore NaN when calculating mean of an array #4552
Comments
But a |
You could also use a DataArray, which are part of DataFrames.jl, but which soon should be in their own package. |
The I would personally prefer having |
It's definitely less trivial if you're wanting to compute means along particular dimension(s); clearly we should provide library functions for this type of thing. The standard tool is
This won't be ideally efficient because it takes two passes over the array. (You could write a better version using Cartesian.jl.) But as one nice advantage over matlab, you can compute the mean over multiple dimensions at once; this gives you the correct answer, which is not true of the matlab version @jiahao I illustrated by calling this @kmsquire, glad to hear that DataArray is going to be split out! I have been wanting a way to indicate bad pixels in images where the pixel value is encoded by integers (and hence NaN is not available). I may have some contributions (e.g., different encodings of NA pixels, such as using an outer-product). |
Chiming in briefly: DataArray is already fully split out and feature complete, it's actually DataFrames that's holding us back since the dev branch called |
Thanks, @johnmyleswhite. |
Thanks, @johnmyleswhite , @timholy , @kmsquire I am also interested about whether the feature of nan-ignoring will be available in the next version of julia or the next version of DataFrames. As I often encounter observation data with NA, I think this sort of function is very valueable for the scientists which deal with observations much. |
I still feel that this is the kind of use case that motivated DataFrames.jl in the first place. This thread in julia-dev lays it out very clearly - there is even a discussion about the suitability of Given the back history, I think the right thing to do is to leave the core statistical functions as they are, and relegate |
Filing under |
Thank you, @jiahao . |
Do you mean DataFrames or DataArrays? It would be unfortunate to have to use DataFrames to get standard functions to ignore NaN or NA's. DataFrames are a bad fit for higher dimensional data. I would be in favor of both a DIM (or region) input and a NaN keyword for all the standard statistical functions and then the DataArray versions could add a NA keyword as well. The NaN kw is really most useful when used in combination with a DIM (or region). |
Yes, everything statistical should happen at the level of DataArrays, not DataFrames. That's part of the reason we split DataArrays into a separate package that can be used without DataFrames. |
I think the right thing to do is to leave the |
Recommend users who need to handle missing data to use the DataArrays package. [#4552] TODO: Revisit documentation of functions currently not overloaded by DataArrays. [JuliaStats/DataArrays.jl#3]
The inability to ignore missing values easily with functions (e.g. nanmean or a I suggest another method or two for mean: The solution needs to be very simple to win over matlab users, for example. The user that knows the risks should be able to use IEEE NaN as missing (or any value) when they want to. This practice will continue until standard methods for dealing with missing data are widely adopted in many languages and data set formats. I have never had a problem with doing it. The |
@deszoeke There's now also the NaNMath.jl package: julia> using NaNMath
NaNMath.mean([1., 2., NaN])
1.5 |
This feature doesn't need to be in New function methods are required because mere wrappers on |
With generators you can now do this efficiently and with a relatively short syntax: julia> X = [1.0, NaN]
2-element Array{Float64,1}:
1.0
NaN
julia> mean(x for x in X if !isnan(x))
1.0 Though I think having an argument in Base to skip |
If we merge |
Some tests
But I want to be able to average over particular dimensions. The simple competitive statement from matlab is Y=nanmean(X,dim), though as @timholy notes, nesting that to average over several dimensions doesn't do the same thing as averaging all values equally weighted. Combining comprehensions to select dimensions,
It's a bit of a mouthful and at >0.2 s it seems quite slow. Maybe because comprehensions in the REPL have to compile or aren't inlined. I like both approaches for their generality, but I suspect that test for missing values is faster to do at the level of the accumulator within mean. |
A general tip is that when you want something for multidmensional arrays, when in doubt look at the Images.jl package: julia> using Images
julia> A = rand(3,5)
3×5 Array{Float64,2}:
0.621814 0.36342 0.308406 0.249511 0.866264
0.844332 0.491472 0.960657 0.284304 0.291467
0.315552 0.640316 0.404262 0.810443 0.910141
julia> A[1,2] = NaN
NaN
julia> meanfinite(A, 2)
3×1 Array{Float64,2}:
0.511499
0.574446
0.616143
julia> meanfinite(A, 1)
1×5 Array{Float64,2}:
0.593899 0.565894 0.557775 0.448086 0.68929
julia> A = rand(1000,1000); A[2:100:900, 5:80:200] = NaN;
julia> @time meanfinite(A, 1);
0.002133 seconds (155 allocations: 32.188 KB)
julia> @time meanfinite(A, 2);
0.000850 seconds (29 allocations: 24.625 KB) |
To confirm, is the quickest (ready made) way of ignoring Nans in a mean to use meanfinite in images? What about for calculating median while ignoring NaNs? |
You can check it yourself whether it's as fast as |
+1 for It's unfortunate for Julia that really useful statistical operators are hidden away in unexpected places like the Images.jl package. I'm not doing image processing, so why would I look there? Inspired by |
See https://github.com/deszoeke/ConditionalMean.jl for a reworking of the flexibly-dimensioned summations in
Tests and pull requests are welcome. |
I would like compute the covariance matrix The unsettledness of standard(s) for how to handle missing data is an impediment to developing useful functions and methods. The long list of breaking, late-breaking, and deprecated ways to do this include This diversity and changing architecture leads to severe usability problems. Searching the discussions, I finds different computer science and data science philosophies and practical reasons for one approach over another. I respect these arguments, but it's impossible to tell what works, what's deprecated, or broken. |
@deszoeke, this closed issue is not the place for discussing this. Please post questions, frustrations or package announcements to the Julia discourse discussion forum. |
I try to evalute mean() on an array containing some NaN, I cannot find a way to ignore the NaN.
And I fail to find any option like na.rm in R or any function like nanmean in matlab
The text was updated successfully, but these errors were encountered: