Skip to content
This repository has been archived by the owner on Jan 12, 2020. It is now read-only.

Add post on missing values #770

Merged
merged 9 commits into from
Jun 20, 2018
Merged

Add post on missing values #770

merged 9 commits into from
Jun 20, 2018

Conversation

nalimilan
Copy link
Member

As promised a long time ago.

Comments welcome on contents, technical details, phrasing, presentation, etc.

Copy link
Member

@ararslan ararslan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

[version 0.11](https://discourse.julialang.org/t/dataframes-0-11-released/7296/)
of the [DataFrames](https://github.com/JuliaStats/DataFrames.jl/) package,
which already works on Julia 0.6, even if performance improvements
will only become available with Julia 0.7
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a period at the end of this sentence.

[`NullableArray`](https://github.com/JuliaStats/NullableArrays.jl) had to be used
(similar to `DataArray`).

For all these reasons, `Nullable{T}` will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will? It already has.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I used the future in the rest of the post to imply that these features are going to be available once the release is out. To avoid the ambiguity here, I've changed the phrasing to "no longer exists".

or [Apache Arrow](https://arrow.apache.org/docs/memory_layout.html#null-bitmaps)
use bitmaps equivalent to `BitArray`.

## Safety and propagation by default

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe call this "A data scientist's null" (maybe with subtitle like the current title), to aid skimability

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

....and maybe point out at the very top that poor handling of missing values is a common source of bugs/errors in published work so it's critical to get right (with some links). I think that's a really important motivation for this whole project.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a link. Do you have pointers in fields other than economics?

Regarding the title, I'm not sure using "a data scientist's null" for this section is a good idea given that it applies to all sections (which is mentioned at the top): data scientists need generic, efficient and safe missing values.

1
2

Second, the `coalesce` function returns the first non-missing argument, which

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this name is borrowed from somewhere but I'm not sure where; maybe include footnote/pointer for that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Borrowed as in the idea comes from another language? SAS has coalesce.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it exists in SQL.

See the [manual](https://docs.julialang.org/en/latest/manual/missing/) for more details
and illustrations about these rules. Let us note that they follow very closely those
implemented by SQL's `NULL` and R's `NA`, making it easy to translate Julia code into
SQL requests.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a really important design decision and should be mentioned at the very top in my opinion

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree. I've added another mention in the intro.

@ViralBShah
Copy link
Member

Would it be possible to add some timings to the kinds of code patterns that have been reported to be slow - like John Myles White's blog etc.? Just @time ... to show how fast things are.

If there are obvious comparisons that can be done with R and Python missing value handling, that would be of broader interest to readers who are not already Julia users - but that may be a bit more work.

@nalimilan
Copy link
Member Author

Would it be possible to add some timings to the kinds of code patterns that have been reported to be slow - like John Myles White's blog etc.? Just @time ... to show how fast things are.

I'll have a look. Hopefully my claims will be confirmed... :-p

If there are obvious comparisons that can be done with R and Python missing value handling, that would be of broader interest to readers who are not already Julia users - but that may be a bit more work.

What kind of comparison do you have in mind? I tend to think the examples are simple enough that readers should be able to identify the equivalent in languages they know very easily.


The first improvement involves optimizations for small `Union` types.
When type inference detects that a variable can hold values of multiple types but
that these types are in limited number (as is the case for `Union{Missing,T}`),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a good time to be clear on what this actually means so that way it can be referenced in the future. What is a small union type? What kinds of types can this be done with? Is it only unions of two bitstypes that this works with? I would like to see a footnote here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I'm the best person to document this, but I've added a note reflecting my (limited) understanding of how it works. I'd appreciate if others could confirm it's correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we'd need @vtjnash input on what the codegen limitations are here; I'm not sure if there's a limit on the # of union types that codegen will code-split on. Or maybe @Keno can comment on what the new compiler/optimizer does in union code-splitting?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the footnote I added, there's a constant for that. But it would be good to have somebody check that it's correct.

package, which used to be the standard way of representing missing data in Julia.
`missing` is actually very similar to its predecessor `NA`, but it benefits from many
improvements in the Julia compiler and language which make it fast, making it possible
to allow drop the `DataArray` type and using the standard `Array` type instead[^PDA].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allow drop the doesn't make sense.

In order to provide a consistent representation of missing values which can be combined
with any type, Julia 0.7 will use `missing`, an object with no fields which is the only
instance of the the `Missing` singleton type. This is a normal Julia type with a few
peculiarities which are detailed below. Values which can be either of type `T` or missing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan of the word peculiarities here, which carries just a slight negative connotation; like missing had to be special-cased in the core language or something (which isn't true; just small unions have been optimized).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about promote_typejoin, but I realize I haven't mentioned it. That's quite technical, so maybe a footnote will be enough. Anyway I can remove "peculiarities".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a “the the” here on line 136.


The first improvement involves optimizations for small `Union` types.
When type inference detects that a variable can hold values of multiple types but
that these types are in limited number (as is the case for `Union{Missing,T}`),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we'd need @vtjnash input on what the codegen limitations are here; I'm not sure if there's a limit on the # of union types that codegen will code-split on. Or maybe @Keno can comment on what the new compiler/optimizer does in union code-splitting?

The second one consists in using a compact memory layout for arrays with `Union`s
of bits types. The standard `Array` type now uses an optimized memory layout for
element types which are `Union` of bits types, i.e. immutable types which contain
no references (see `isbits`). This includes `Missing` and basic types such as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming we want to actually link to isbits here?

French National Institute for Demographic Studies (Ined), Paris.

**Acknowledgements**: This framework is the result of collective efforts over several
years. John Myles White lead the reflection around missing values support in Julia
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lead => led

years. John Myles White lead the reflection around missing values support in Julia
until 2016. Jameson Nash implemented compiler optimizations, and Jacob Quinn
implemented the efficient memory layout for arrays. Alex Arslan, Jeff Bezanson,
Stefan Karpinski, Jameson Nash and Jacob Quinn have been the most central
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can include yourself in this list as well :) I think @davidanthoff would be a good mention as well, with lots of contributions on things to consider w/ missing values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the author of the post, I'm not sure I also need to be in the list. ;-)

I'll add David too.

values are involved. This is not insurmountable since masked SIMD instructions allow applying
an operation only to some values (the non-missing ones). While the absence of SIMD reduces
noticeably the performance of many operations, it appears that Julia already achieves
the same speed as vectorized operations in R (which are implemented in C). So there is
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not completely true currently due to the absence of inlining, which is why I'd rather wait a bit more for performance improvements before publishing the post.

@nalimilan
Copy link
Member Author

I've added benchmarks now that inlining works (JuliaLang/julia#27651), and improved a few things. More comments before merging?

@davidanthoff
Copy link
Contributor

Could you change the ack to be more the standard academic language ("thanks X, Y and Z for input, not implying they agree" bla bla)? The current version reads a bit as if there was a consensus about the design, which at least in my case is not the case.

@nalimilan
Copy link
Member Author

Can you suggest a wording? I think apart from you the other people I've cited generally agree with the design. I can put you in a separate sentence if you prefer, though it could sound a bit weird.

@davidanthoff
Copy link
Contributor

I don't really care what language you use, as long as you don't give the impression that I'm on board with this design.

function sum_nonmissing(X::AbstractArray)
s = zero(eltype(X))
@inbounds @simd for x in X
if x !== missing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!ismissing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read below... :-p

(Currently it's much slower unfortunately.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops. 😅

@nalimilan
Copy link
Member Author

I don't really care what language you use, as long as you don't give the impression that I'm on board with this design.

Precisely, I don't know how to phrase that. Currently I just say you participated in discussions.

@nalimilan
Copy link
Member Author

I've tried something, please tell me whether it's OK for you.

I'll merge tomorrow if there are no additional comments.

@davidanthoff
Copy link
Contributor

Sounds good, thanks!

@@ -489,8 +489,9 @@ Research scientist at the French Institute for Demographic Studies (Ined), Paris

**Acknowledgements**: This framework is the result of collective efforts over several
years. John Myles White led the reflection around missing values support in Julia
until 2016. Jameson Nash implemented compiler optimizations, and Jacob Quinn
implemented the efficient memory layout for arrays. David Anthoff, Alex Arslan,
until 2016. Jameson Nash and Keno Fisher implemented compiler optimizations, and Jacob Quinn
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fischer

@nalimilan nalimilan merged commit 27923e7 into master Jun 20, 2018
@nalimilan nalimilan deleted the nl/missing branch June 20, 2018 07:57
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants