Skip to content
This repository has been archived by the owner on Jan 12, 2020. It is now read-only.

Commit

Permalink
Improvements
Browse files Browse the repository at this point in the history
  • Loading branch information
nalimilan committed Apr 2, 2018
1 parent 0405dae commit 2aa59cf
Showing 1 changed file with 89 additions and 84 deletions.
173 changes: 89 additions & 84 deletions blog/_posts/2018-03-29-missing.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,22 +24,19 @@ and do not provide an efficient representation of arrays with missing values[^nu
(with `Nothing`) are two partial exceptions to this rule, since they provide
*lifted* operators which operate on `Nullable` arguments and return `Nullable`s.

Drawing from the experience of existing languages, the design of missing values
in Julia 0.7 closely follows that of SQL's `NULL` and R's `NA`, which can be considered
as the most consistent implementations with regard to the support of missing values.
Incidentally, this makes it easy to generate SQL requests from Julia code or to have
R and Julia interoperate.

## Safety and propagation by default

Julia 0.7 will introduce a new `missing` object used to represent statistical
missing values. Resulting from intense design discussions, experimentations and language
improvements developed over several years, it is the heir of the `NA` value
implemented in the [DataArrays](https://github.com/JuliaStats/DataArrays.jl)
package, which used to be the standard way of representing missing data in Julia.
`missing` is actually very similar to its predecessor `NA`, but it benefits from many
improvements in the Julia compiler and language which make it fast, making it possible
to allow drop the `DataArray` type and using the standard `Array` type instead[^PDA].
to drop the `DataArray` type and using the standard `Array` type instead[^PDA].
Drawing from the experience of existing languages, the design of `missing` closely
follows that of SQL's `NULL` and R's `NA`, which can be considered
as the most consistent implementations with regard to the support of missing values.
Incidentally, this makes it easy to generate SQL requests from Julia code or to have
R and Julia interoperate.

[^PDA]: The `PooledDataArray` type shipped in the same package can be replaced with
either [`CategoricalArray`](https://github.com/JuliaData/CategoricalArrays.jl) or
Expand All @@ -49,11 +46,69 @@ the data is really categorical or simply contains a small number of distinct val
This framework is used by
[version 0.11](https://discourse.julialang.org/t/dataframes-0-11-released/7296/)
of the [DataFrames](https://github.com/JuliaStats/DataFrames.jl/) package,
which already works on Julia 0.6, even if performance improvements
will only become available with Julia 0.7.
which already works on Julia 0.6 via the [Missings](https://github.com/JuliaData/Missings.jl)
package, even if performance improvements will only become available with Julia 0.7.

This post illustrates the expression "first class support" by presenting three
properties of the Julia 0.7 implementation of statistical missing values:

1. Missing values are safe by default: when passed to most functions, they either
propagate or throw an error.

2. The `missing` object can be used in combination with any type, be it defined in
Base, in a package or in user code.

3. Standard Julia code working with missing values is efficient, without special tricks.

The post first presents the behavior of the new `missing` object, and then details its
implementation, in particular performance considerations. Finally, current limitations
and future improvements are discussed.

## The `missing` object: safe and generic missing values

One of Julia's strengths is that user-defined types are as powerful and fast as built-in
types. To fully take advantage of this, missing values had to support not only standard
types like `Int`, `Float64` and `String`, but also any custom type. For this reason,
Julia cannot use the so-called *sentinel* approach like R and Pandas to represent
missingness, that is reserving special values within a type's domain. For example,
R represents missing values in integer and boolean vectors using the smallest
representable 32-bit integer (`-2,147,483,648`), and missing values in floating point
vectors using a specific `NaN` payload (`1954`, which rumour says refers to Ross Ihaka's
year of birth). Pandas only supports missing values in floating point vectors,
and conflates them with `NaN` values.

In order to provide a consistent representation of missing values which can be combined
with any type, Julia 0.7 will use `missing`, an object with no fields which is the only
instance of the `Missing` singleton type. This is a normal Julia type for which a series
of useful methods are implemented. Values which can be either of type `T` or missing
can simply be declared as `Union{Missing,T}`. For example, a vector holding either integers
or missing values is of type `Array{Union{Missing,Int},1}`:

In addition to being generic and efficient, the new missing values support in
Julia 0.7 aims to provide safety, in the sense that missing values should never
julia> [1, missing]
2-element Array{Union{Missing, Int64},1}:
1
missing

An interesting property of this approach is that `Array{Union{Missing,T}}` behaves just
like a normal `Array{T}` as soon as missing values have been replaced or skipped
(see below).

As can be seen in the example above, promotion rules are defined so that concatenating
values of type `T` and missing values gives an array with element type `Union{Missing,T}`
rather than `Any`[^typejoin]:

julia> promote_type(Int, Missing)
Union{Missing, Int64}

[^typejoin]: In addition to these `promote_rule` methods, the `Missing` and `Nothing` types
implement the internal `promote_typejoin` function, which ensures that functions such
as `map` and `collect` return arrays with element types `Union{Missing,T}` or
`Union{Nothing,T}` instead of `Any`.

These promotion rules are essential for performance, as we will see below.

In addition to being generic and efficient, the main design goal of the new `missing`
framework is to ensure safety, in the sense that missing values should never
be silently ignored nor replaced with non-missing values. Missing values are a
delicate issue in statistical work, and a frequent source of bugs or invalid results.
Ignoring missing values amounts to performing data imputation, which should never
Expand Down Expand Up @@ -177,14 +232,14 @@ Short-circuiting operators `&&` and `||`, just like `if` conditions, throw an er
if they need to evaluate a missing value.

See the [manual](https://docs.julialang.org/en/latest/manual/missing/) for more details
and illustrations about these rules. Let us note that they are generally consistent with
and illustrations about these rules. As noted above, they are generally consistent with
those implemented by SQL's `NULL` and R's `NA`.

## From `NA` and `Nullable` to `missing`
## From `Nullable` to `missing` and `nothing`

The new `Missing` type also replaces the `Nullable` type introduced in Julia 0.4,
which turned out not to be the best choice to represent missing values[^jmw]. `Nullable`
suffered from several issues:
While it is similar to the previous `NA` value, the new `missing` object also replaces
the `Nullable` type introduced in Julia 0.4, which turned out not to be the best choice
to represent missing values[^jmw]. `Nullable` suffered from several issues:

[^jmw]: In [a 2014 blog post](http://www.johnmyleswhite.com/notebook/2014/11/29/whats-wrong-with-statistics-in-julia/),
John Myles White advocated the use of `Nullable` due to its much higher performance
Expand Down Expand Up @@ -227,64 +282,13 @@ in Julia 0.7. Several replacements are provided, depending on the use case:
if `nothing` is a possible value (i.e. `Nothing <: T`), `Union{Nothing,Some{T}}`
should be used instead. This pattern is used by e.g. `findfirst` and `tryparse`.

This blog post covers the first case, and should hopefully make it clear why
it is useful to distinguish `missing` and `nothing`. To give a first insight, let
us note that the main difference between these two objects is that `missing`
generally propagates when passed to standard mathematical operators and functions,
while `nothing` does not implement any specific method and therefore generally
gives a `MethodError`.

The rest of the post illustrates the expression "first class support" used
in the title by presenting three properties of the Julia 0.7 implementation of
statistical missing values:

1. The `missing` object can be used in combination with any type, be it defined in
Base, in a package or in user code.

2. Missing values are safe by default: when passed to most functions, they either
propagate or throw an error.

3. Standard Julia code working with missing values is efficient, without special tricks.

Finally, current limitations and future improvements are discussed.

## A generic representation

One of Julia's strengths is that user-defined types are as powerful and fast as built-in
types. To fully take advantage of this, missing values had to support not only standard
types like `Int`, `Float64` and `String`, but also any custom type. For this reason,
Julia cannot use the so-called *sentinel* approach like R and Pandas to represent
missingness, that is reserving special values within a type's domain. For example,
R represents missing values in integer and boolean vectors using the smallest
representable 32-bit integer (`-2,147,483,648`), and missing values in floating point
vectors using a specific `NaN` payload (`1954`, which rumour says refers to Ross Ihaka's
year of birth). Pandas only supports missing values in floating point vectors,
and conflates them with `NaN` values.

In order to provide a consistent representation of missing values which can be combined
with any type, Julia 0.7 will use `missing`, an object with no fields which is the only
instance of the the `Missing` singleton type. This is a normal Julia type with a few
peculiarities which are detailed below. Values which can be either of type `T` or missing
can simply be declared as `Union{Missing,T}`. For example, a vector holding either integers
or missing values is of type `Array{Union{Missing,Int},1}`:

julia> [1, missing]
2-element Array{Union{Missing, Int64},1}:
1
missing

An interesting property of this approach is that `Array{Union{Missing,T}}` behaves just
like a normal `Array{T}` as soon as missing values have been replaced or skipped
(see below).

As can be seen in the example above, promotion rules are defined so that concatenating
values of type `T` and missing values gives an array with element type `Union{Missing,T}`
rather than `Any`:

julia> promote_type(Int, Missing)
Union{Missing, Int64}

These promotion rules are essential for performance, as we will now see.
This blog post is centered on the first case, and hopefully the description of the behavior
of `missing` above makes it clear why it is useful to distinguish it from `nothing`.
Indeed, while `missing` generally propagates when passed to standard mathematical operators
and functions, `nothing` does not implement any specific method and therefore generally
gives a `MethodError`, forcing the caller to handle it explicitly. However, considerations
regarding performance developed below apply equally to `missing` and `nothing` (as well as
to other custom types in equivalent situations).

## An efficient representation

Expand Down Expand Up @@ -314,7 +318,8 @@ constant, which is currently set to 4.
The second one consists in using a compact memory layout for arrays with `Union`s
of bits types. The standard `Array` type now uses an optimized memory layout for
element types which are `Union` of bits types, i.e. immutable types which contain
no references (see `isbits`). This includes `Missing` and basic types such as
no references (see the [`isbits`](https://docs.julialang.org/en/latest/base/base/#Base.isbits)
function). This includes `Missing` and basic types such as
`Int`, `Float64`, `Complex{Float64}` and `Date`. When `T` is a bits type,
`Array{Union{Missing,T}}` objects are internally represented as a pair of arrays
of the same size: an `Array{T}` holding non-missing values and uninitialized memory
Expand Down Expand Up @@ -368,13 +373,13 @@ of these types.

Convenience functions would also be useful to propagate missing values with functions
which have not been written to do it automatically. Constructs like `lift(f, x)`,
`lift(f)(x)` and `f?(x)` have been
[discussed](https://discourse.julialang.org/t/operations-on-missing-values/9785)
`lift(f)(x)` and `f?(x)` have been [discussed](https://github.com/JuliaLang/julia/pull/26661)
to provide a shorter equivalent of `ismissing(x) ? missing : f(x)`.

Other temporary limitations concern compiler optimizations which are not yet implemented.
First, code involving missing values is [not yet](https://github.com/JuliaLang/julia/issues/23338)
as efficient as it could be. Second, conversion between `Array{T}` and
First, function calls involving missing values [are currently](https://github.com/JuliaLang/julia/issues/23338)
never inlined, which incurs a significant penalty for fast operations like `getindex`.
Second, conversion between `Array{T}` and
`Array{Union{Missing,T}}` currently involves a copy. In theory it should be possible
to use the same memory for bits types, since only the type tag array differs.
Third, the Julia compiler is [currently unable](https://github.com/JuliaLang/julia/issues/23336)
Expand Down Expand Up @@ -406,9 +411,9 @@ of the most complete even among specialized statistical languages.
French National Institute for Demographic Studies (Ined), Paris.

**Acknowledgements**: This framework is the result of collective efforts over several
years. John Myles White lead the reflection around missing values support in Julia
years. John Myles White led the reflection around missing values support in Julia
until 2016. Jameson Nash implemented compiler optimizations, and Jacob Quinn
implemented the efficient memory layout for arrays. Alex Arslan, Jeff Bezanson,
Stefan Karpinski, Jameson Nash and Jacob Quinn have been the most central
implemented the efficient memory layout for arrays. David Anthoff, Alex Arslan,
Jeff Bezanson, Stefan Karpinski, Jameson Nash and Jacob Quinn have been the most central
participants in long and complex design discussions which have involved many other
developers.

0 comments on commit 2aa59cf

Please sign in to comment.