Improvements

JuliaLang · Apr 2, 2018 · 2aa59cf · 2aa59cf
1 parent 0405dae
commit 2aa59cf
Showing 1 changed file with 89 additions and 84 deletions.
diff --git a/blog/_posts/2018-03-29-missing.md b/blog/_posts/2018-03-29-missing.md
@@ -24,22 +24,19 @@ and do not provide an efficient representation of arrays with missing values[^nu
 (with `Nothing`) are two partial exceptions to this rule, since they provide
 *lifted* operators which operate on `Nullable` arguments and return `Nullable`s.
 
-Drawing from the experience of existing languages, the design of missing values
-in Julia 0.7 closely follows that of SQL's `NULL` and R's `NA`, which can be considered
-as the most consistent implementations with regard to the support of missing values.
-Incidentally, this makes it easy to generate SQL requests from Julia code or to have
-R and Julia interoperate.
-
-## Safety and propagation by default
-
 Julia 0.7 will introduce a new `missing` object used to represent statistical
 missing values. Resulting from intense design discussions, experimentations and language
 improvements developed over several years, it is the heir of the `NA` value
 implemented in the [DataArrays](https://github.com/JuliaStats/DataArrays.jl)
 package, which used to be the standard way of representing missing data in Julia.
 `missing` is actually very similar to its predecessor `NA`, but it benefits from many
 improvements in the Julia compiler and language which make it fast, making it possible
-to allow drop the `DataArray` type and using the standard `Array` type instead[^PDA].
+to drop the `DataArray` type and using the standard `Array` type instead[^PDA].
+Drawing from the experience of existing languages, the design of `missing` closely
+follows that of SQL's `NULL` and R's `NA`, which can be considered
+as the most consistent implementations with regard to the support of missing values.
+Incidentally, this makes it easy to generate SQL requests from Julia code or to have
+R and Julia interoperate.
 
 [^PDA]: The `PooledDataArray` type shipped in the same package can be replaced with
 either [`CategoricalArray`](https://github.com/JuliaData/CategoricalArrays.jl) or
@@ -49,11 +46,69 @@ the data is really categorical or simply contains a small number of distinct val
 This framework is used by
 [version 0.11](https://discourse.julialang.org/t/dataframes-0-11-released/7296/)
 of the [DataFrames](https://github.com/JuliaStats/DataFrames.jl/) package,
-which already works on Julia 0.6, even if performance improvements
-will only become available with Julia 0.7.
+which already works on Julia 0.6 via the [Missings](https://github.com/JuliaData/Missings.jl)
+package, even if performance improvements will only become available with Julia 0.7.
+
+This post illustrates the expression "first class support" by presenting three
+properties of the Julia 0.7 implementation of statistical missing values:
+
+1. Missing values are safe by default: when passed to most functions, they either
+   propagate or throw an error.
+
+2. The `missing` object can be used in combination with any type, be it defined in
+   Base, in a package or in user code.
+
+3. Standard Julia code working with missing values is efficient, without special tricks.
+
+The post first presents the behavior of the new `missing` object, and then details its
+implementation, in particular performance considerations. Finally, current limitations
+and future improvements are discussed.
+
+## The `missing` object: safe and generic missing values
+
+One of Julia's strengths is that user-defined types are as powerful and fast as built-in
+types. To fully take advantage of this, missing values had to support not only standard
+types like `Int`, `Float64` and `String`, but also any custom type. For this reason,
+Julia cannot use the so-called *sentinel* approach like R and Pandas to represent
+missingness, that is reserving special values within a type's domain. For example,
+R represents missing values in integer and boolean vectors using the smallest
+representable 32-bit integer (`-2,147,483,648`), and missing values in floating point
+vectors using a specific `NaN` payload (`1954`, which rumour says refers to Ross Ihaka's
+year of birth). Pandas only supports missing values in floating point vectors,
+and conflates them with `NaN` values.
+
+In order to provide a consistent representation of missing values which can be combined
+with any type, Julia 0.7 will use `missing`, an object with no fields which is the only
+instance of the `Missing` singleton type. This is a normal Julia type for which a series
+of useful methods are implemented. Values which can be either of type `T` or missing
+can simply be declared as `Union{Missing,T}`. For example, a vector holding either integers
+or missing values is of type `Array{Union{Missing,Int},1}`:
 
-In addition to being generic and efficient, the new missing values support in
-Julia 0.7 aims to provide safety, in the sense that missing values should never
+    julia> [1, missing]
+    2-element Array{Union{Missing, Int64},1}:
+    1
+    missing
+
+An interesting property of this approach is that `Array{Union{Missing,T}}` behaves just
+like a normal `Array{T}` as soon as missing values have been replaced or skipped
+(see below).
+
+As can be seen in the example above, promotion rules are defined so that concatenating
+values of type `T` and missing values gives an array with element type `Union{Missing,T}`
+rather than `Any`[^typejoin]:
+
+    julia> promote_type(Int, Missing)
+    Union{Missing, Int64}
+
+[^typejoin]: In addition to these `promote_rule` methods, the `Missing` and `Nothing` types
+implement the internal `promote_typejoin` function, which ensures that functions such
+as `map` and `collect` return arrays with element types `Union{Missing,T}` or
+`Union{Nothing,T}` instead of `Any`.
+
+These promotion rules are essential for performance, as we will see below.
+
+In addition to being generic and efficient, the main design goal of the new `missing`
+framework is to ensure safety, in the sense that missing values should never
 be silently ignored nor replaced with non-missing values. Missing values are a
 delicate issue in statistical work, and a frequent source of bugs or invalid results.
 Ignoring missing values amounts to performing data imputation, which should never
@@ -177,14 +232,14 @@ Short-circuiting operators `&&` and `||`, just like `if` conditions, throw an er
 if they need to evaluate a missing value.
 
 See the [manual](https://docs.julialang.org/en/latest/manual/missing/) for more details
-and illustrations about these rules. Let us note that they are generally consistent with
+and illustrations about these rules. As noted above, they are generally consistent with
 those implemented by SQL's `NULL` and R's `NA`.
 
-## From `NA` and `Nullable` to `missing`
+## From `Nullable` to `missing` and `nothing`
 
-The new `Missing` type also replaces the `Nullable` type introduced in Julia 0.4,
-which turned out not to be the best choice to represent missing values[^jmw]. `Nullable`
-suffered from several issues:
+While it is similar to the previous `NA` value, the new `missing` object also replaces
+the `Nullable` type introduced in Julia 0.4, which turned out not to be the best choice
+to represent missing values[^jmw]. `Nullable` suffered from several issues:
 
 [^jmw]: In [a 2014 blog post](http://www.johnmyleswhite.com/notebook/2014/11/29/whats-wrong-with-statistics-in-julia/),
 John Myles White advocated the use of `Nullable` due to its much higher performance
@@ -227,64 +282,13 @@ in Julia 0.7. Several replacements are provided, depending on the use case:
   if `nothing` is a possible value (i.e. `Nothing <: T`), `Union{Nothing,Some{T}}`
   should be used instead. This pattern is used by e.g. `findfirst` and `tryparse`.
 
-This blog post covers the first case, and should hopefully make it clear why
-it is useful to distinguish `missing` and `nothing`. To give a first insight, let
-us note that the main difference between these two objects is that `missing`
-generally propagates when passed to standard mathematical operators and functions,
-while `nothing` does not implement any specific method and therefore generally
-gives a `MethodError`.
-
-The rest of the post illustrates the expression "first class support" used
-in the title by presenting three properties of the Julia 0.7 implementation of
-statistical missing values:
-
-1. The `missing` object can be used in combination with any type, be it defined in
-   Base, in a package or in user code.
-
-2. Missing values are safe by default: when passed to most functions, they either
-   propagate or throw an error.
-
-3. Standard Julia code working with missing values is efficient, without special tricks.
-
-Finally, current limitations and future improvements are discussed.
-
-## A generic representation
-
-One of Julia's strengths is that user-defined types are as powerful and fast as built-in
-types. To fully take advantage of this, missing values had to support not only standard
-types like `Int`, `Float64` and `String`, but also any custom type. For this reason,
-Julia cannot use the so-called *sentinel* approach like R and Pandas to represent
-missingness, that is reserving special values within a type's domain. For example,
-R represents missing values in integer and boolean vectors using the smallest
-representable 32-bit integer (`-2,147,483,648`), and missing values in floating point
-vectors using a specific `NaN` payload (`1954`, which rumour says refers to Ross Ihaka's
-year of birth). Pandas only supports missing values in floating point vectors,
-and conflates them with `NaN` values.
-
-In order to provide a consistent representation of missing values which can be combined
-with any type, Julia 0.7 will use `missing`, an object with no fields which is the only
-instance of the the `Missing` singleton type. This is a normal Julia type with a few
-peculiarities which are detailed below. Values which can be either of type `T` or missing
-can simply be declared as `Union{Missing,T}`. For example, a vector holding either integers
-or missing values is of type `Array{Union{Missing,Int},1}`:
-
-    julia> [1, missing]
-    2-element Array{Union{Missing, Int64},1}:
-    1
-    missing
-
-An interesting property of this approach is that `Array{Union{Missing,T}}` behaves just
-like a normal `Array{T}` as soon as missing values have been replaced or skipped
-(see below).
-
-As can be seen in the example above, promotion rules are defined so that concatenating
-values of type `T` and missing values gives an array with element type `Union{Missing,T}`
-rather than `Any`:
-
-    julia> promote_type(Int, Missing)
-    Union{Missing, Int64}
-
-These promotion rules are essential for performance, as we will now see.
+This blog post is centered on the first case, and hopefully the description of the behavior
+of `missing` above makes it clear why it is useful to distinguish it from `nothing`.
+Indeed, while `missing` generally propagates when passed to standard mathematical operators
+and functions, `nothing` does not implement any specific method and therefore generally
+gives a `MethodError`, forcing the caller to handle it explicitly. However, considerations
+regarding performance developed below apply equally to `missing` and `nothing` (as well as
+to other custom types in equivalent situations).
 
 ## An efficient representation
 
@@ -314,7 +318,8 @@ constant, which is currently set to 4.
 The second one consists in using a compact memory layout for arrays with `Union`s
 of bits types. The standard `Array` type now uses an optimized memory layout for
 element types which are `Union` of bits types, i.e. immutable types which contain
-no references (see `isbits`). This includes `Missing` and basic types such as
+no references (see the [`isbits`](https://docs.julialang.org/en/latest/base/base/#Base.isbits)
+function). This includes `Missing` and basic types such as
 `Int`, `Float64`, `Complex{Float64}` and `Date`. When `T` is a bits type,
 `Array{Union{Missing,T}}` objects are internally represented as a pair of arrays
 of the same size: an `Array{T}` holding non-missing values and uninitialized memory
@@ -368,13 +373,13 @@ of these types.
 
 Convenience functions would also be useful to propagate missing values with functions
 which have not been written to do it automatically. Constructs like `lift(f, x)`,
-`lift(f)(x)` and `f?(x)` have been
-[discussed](https://discourse.julialang.org/t/operations-on-missing-values/9785)
+`lift(f)(x)` and `f?(x)` have been [discussed](https://github.com/JuliaLang/julia/pull/26661)
 to provide a shorter equivalent of `ismissing(x) ? missing : f(x)`.
 
 Other temporary limitations concern compiler optimizations which are not yet implemented.
-First, code involving missing values is [not yet](https://github.com/JuliaLang/julia/issues/23338)
-as efficient as it could be. Second, conversion between `Array{T}` and
+First, function calls involving missing values [are currently](https://github.com/JuliaLang/julia/issues/23338)
+never inlined, which incurs a significant penalty for fast operations like `getindex`.
+Second, conversion between `Array{T}` and
 `Array{Union{Missing,T}}` currently involves a copy. In theory it should be possible
 to use the same memory for bits types, since only the type tag array differs.
 Third, the Julia compiler is [currently unable](https://github.com/JuliaLang/julia/issues/23336)
@@ -406,9 +411,9 @@ of the most complete even among specialized statistical languages.
 French National Institute for Demographic Studies (Ined), Paris.
 
 **Acknowledgements**: This framework is the result of collective efforts over several
-years. John Myles White lead the reflection around missing values support in Julia
+years. John Myles White led the reflection around missing values support in Julia
 until 2016. Jameson Nash implemented compiler optimizations, and Jacob Quinn
-implemented the efficient memory layout for arrays. Alex Arslan, Jeff Bezanson,
-Stefan Karpinski, Jameson Nash and Jacob Quinn have been the most central
+implemented the efficient memory layout for arrays. David Anthoff, Alex Arslan,
+Jeff Bezanson, Stefan Karpinski, Jameson Nash and Jacob Quinn have been the most central
 participants in long and complex design discussions which have involved many other
 developers.