-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NA/missing values #470
Comments
For the record, my thinking is more along these lines:
|
Ah, ok. I'll play around more with these options soon. We'll also need boolean and string NAs. (And factor NAs, when we have factors or similar...) |
Yikes. With all of these different type of NAs, maybe a parametric type is better. Something like this:
That's going to be far less efficient than what I was proposing above, but it would allow us to express the behavior of NA types once instead of five separate times. You could write generic operations like this:
|
It sounds like what you really want is a parametrized Maybe type, like in Haskell. Would issue #414 make something like that less painful? Annotation would allow the compiler to easily infer whether a function has the capability to return a None or if it is always going to give Justs. Then you can convert to the different payload at the last minute in some cases, instead of dragging an extra byte around all over the place. |
Jacob, I don't think function annotation for purity is related to this question. The + operator needs to be able to deal with NAs, but it's impossible to know whether external data has NAs or not at compile time. But I might be misunderstanding... Scala also has an Option type: http://www.codecommit.com/blog/scala/the-option-pattern But you have to wrap everything in Some() all of the time, which feels clunky to me. Stefan, hm, I dunno, I think performance is important here. If we had immutable arrays, you could do a one-type check at initialization time for any NAs in the object, then do a simple check at access time to determine what method to use. But with mutable arrays, that might require some bookkeeping... |
I suspect that the biggest performance hit here would actually be from the indirect storage that would currently be forced by having arrays of NA objects. If we implement inline storage for arrays of immutable objects, that would go away. The extra boolean operations to track NA values seems kind of unavoidable to me and probably wouldn't be any worse than any of the other approaches. The main issue is that machines don't do things like integer arithmetic with NA semantics for you. NaN behaves basically the way you'd want it to, so for floats only, you could potentially get normal arithmetic speed while supporting NA by making NA a special NaN value. |
More info on how other lanugages/packages deal with NA: http://pandas.pydata.org/pandas-docs/stable/missing_data.html Pandas for Python doesn't really support NA, as NumPy doesn't yet implement it. |
There have been some very long discussions on handling NA in numpy on the numpy-discussion mailing list, and there is a "NEP" here: https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst |
Congrats on Julia, looks like Julia gets many core things right that other systems such as {matlab, SAS, R} got wrong, such as {pricing, language, performance}. I share Ross Ihaka's view, that R gets so many things wrong - with respect to performance - that it is justified to start from scratch with a new language, and it would be exciting if with Julia we would quickly get a good start from scratch, ideally joining efforts with Ross and other people who desire a restart. Having said this, I hope it is not too late to fix those things that Julia doesn't get right so far. The most obvious thing our little girl Julia needs to learn about is "missing value handling". I dare predict that without proper missing value handling Julia will not be able to replace R, because R gets this quite right (with minor exceptions, see below). Here is a short story about what happens if one gets missing values (NA) wrong: in SAS NA<0 -> TRUE. As a consequence in their PROC SQL, NULL<0 -> TRUE, which breaks the SQL-standard. So SAS has different semantics in SQL than DBs following the SQL standard (like Oracle). Worse than that, in SAS's Access to Oracle Interface, SAS feels free to decide whether pulling data and evaluate a SQL-statement in SAS or whether to push the SQL-statement for evaluation into Oracle (because in-database processing can be much faster). SAS as of today pushes SQL conditions without modifications, i.e. it does not enforce its deviating semantic when pushing to Oracle. As a consequence SAS SQL semantics are not only deviating from the standard, SAS SQL semantics are unpredictable in certain contexts. That's quite a mess, so let's save little Julia to end with such a destiny. Today Julia doesn't have NA in string, integer and boolean types, only in floating point types it has NaN: x = 0/0 # create a NaN In logical and comparison operations we can either propagate NAs (and return NA) or short-circuit over them in certain contexts and return TRUE or FALSE: x==y # FALSE: propagating NA and not confirming equality would be ok in certain contexts (defining a bi-boolean filter on equal values) x<y # FALSE: propagating NA by not stating that "x<y" might seem OK If we accept that we should propagate NaN or NA in numerical computations (following the IEEE 754 standard), and if we follow E.T. Jaynes in understanding that logic reasoning is a special case of (numerical) probability calculation (http://bayes.wustl.edu/), then we also need to propagate NA in logical reasoning. While it is OK for functions such as isless() and isequal() to not propagate NAs and return bi-booleans, general comparison operators need to return a tri-boolean, R gets this right. As I see it, it is not a question whether Julia needs consistent NA handling, the question is how to get there without sacrificing simplicity and performance. Let's start with the question how to represent NAs and defer to later the question how to handle NAs. Some people - e.g. in NumPy - suggest to represent NAs in a masking vector residing in a seperate memory location, this is neither simple nor does it helps performance. I think R's solution to sacrifice just one value of a type domain is the way to go. Using a parametric type (vector of unions) instead, would require at least one dedicated bit, which is RAM-wise more expensive and opens all kind of problems with alignment or wasting even more RAM. I truly like Stefan Karpinski's suggestion to mark NAs in the data vector and only store NA reasons in a separate meta data vector (if those reasons are ever needed). Julia should have NAs in all data types, with very few exceptions: there are good reasons to have a true (bi)"boolean" datatype without NAs (requiring only a single bit), and a tri-boolean "logical" datatype with NAs like in R, but occupying only 2 instead of 32 bits. Unsigned integers could also get away without NAs. In 'ff' (a package enhancing R with on-disk data-types) we choose to have NAs for signed, but not for unsigned integers. That gives us for example a unsigned 2-bit integer that can represent 4-valued factors (covering ATGC for bio-informatics) and signed 2-bit integers that can represent {NA,-1,0,1}. Not having NAs in unsigned integers does not introduce inconsistencies, if we never promote unsigned to signed integers: julia> -2 + convert(Uint8, 1) julia> typeof(ans) Ouch! Using the smallest negative integer as representation of NA has the mathematical beauty of creating a symetric value range and the practical advantage of being compatible with R (and C-Code written for R). Note that representing NA by the smallest integer and defining NA to be ordered above the largest integer (in isless()) is inconsistent and has negative performance implications. This is a point R has not solved optimally: R's 'order()' has default 'na.last=TRUE', by default sorts {-1,0,1,NA} instead of {NA,-1,0,1}. Sorting in C gives us 'na.first' for free, if we want to implement 'na.last', the comparison function in our sort (called O(n*log(n)) times) changes from a single "x<y" to a much more expensive: "x<y ? (ISNA(x) ? FALSE : TRUE) : ( (ISNA(y) && !ISNA(x)) ? TRUE : FALSE )". Is there any specific advantage of defining NA to be the last value of the domain? In doubles R does distinguish between IEEE NaN and a special NA (a NaN with a special payload). I tend to believe that this overcomplicates matters, and - following Stefan - reasons for NAs should be kept seperate. So far for today. Let me know, if you like more thougts on Julia's NA handling. Cheers Jens Oehlschlägel |
@joehl, thanks for your interest in Julia! These R-ish things aren't my area, but are certainly important to a large part of the technical computing community. There've been multiple discussions on this topic in -dev, and I think Harlan has some working prototypes. I encourage you to check those out. |
I am closing this issue as this discussion is on the mailing list and is being addressed in JuliaData. |
unless the user has explicitly asked for it with --startup-file=yes (cherry picked from commit 40d7f27f2ff08ec466df536f267129a9f5e950b4)
* don't use startup.jl when precompiling, building and testing (#470) unless the user has explicitly asked for it with --startup-file=yes (cherry picked from commit 40d7f27f2ff08ec466df536f267129a9f5e950b4) * do not precompile packages that have opt out to precompilation (cherry picked from commit 57f7380a2641944be12695e92a3ad9f4cc20e6f2)
unless the user has explicitly asked for it with --startup-file=yes (cherry picked from commit 40d7f27f2ff08ec466df536f267129a9f5e950b4) (cherry picked from commit eb96811)
unless the user has explicitly asked for it with --startup-file=yes (cherry picked from commit 40d7f27f2ff08ec466df536f267129a9f5e950b4) (cherry picked from commit eb96811)
Stdlib: SparseArrays URL: https://github.com/JuliaSparse/SparseArrays.jl.git Stdlib branch: main Julia branch: master Old commit: 37fc321 New commit: 7786a73 Julia version: 1.11.0-DEV SparseArrays version: 1.11.0 Bump invoked by: @IanButterworth Powered by: [BumpStdlibs.jl](https://github.com/JuliaLang/BumpStdlibs.jl) Diff: JuliaSparse/SparseArrays.jl@37fc321...7786a73 ``` $ git log --oneline 37fc321..7786a73 7786a73 Add Aqua compat. Create CompatHelper.yml (#470) ``` Co-authored-by: Dilum Aluthge <dilum@aluthge.com>
As discussed in this thread, Julia needs to support data with missing values. Current thinking seems to be to create a parallel system of union types (e.g., IntData), promotion, and methods, rather than implementing anything at the bit level (which could be done, at least for floating point numbers). Note that Matlab suggests overloading NaN for missing data, which is not a good idea, and R uses NaN payload for floating NAs.
References:
http://www.pauldickman.com/teaching/sas/missing.php
http://cran.r-project.org/doc/manuals/R-lang.pdf (section 3.3.4)
The text was updated successfully, but these errors were encountered: