Add post on missing values #770

nalimilan · 2018-03-29T21:13:34Z

As promised a long time ago.

Comments welcome on contents, technical details, phrasing, presentation, etc.

ararslan

💯

ararslan · 2018-03-29T21:40:59Z

blog/_posts/2018-03-29-missing.md

+[version 0.11](https://discourse.julialang.org/t/dataframes-0-11-released/7296/)
+of the [DataFrames](https://github.com/JuliaStats/DataFrames.jl/) package,
+which already works on Julia 0.6, even if performance improvements
+will only become available with Julia 0.7


Missing a period at the end of this sentence.

ararslan · 2018-03-29T21:42:49Z

blog/_posts/2018-03-29-missing.md

+  [`NullableArray`](https://github.com/JuliaStats/NullableArrays.jl) had to be used
+  (similar to `DataArray`).
+
+For all these reasons, `Nullable{T}` will


Will? It already has.

I think I used the future in the rest of the post to imply that these features are going to be available once the release is out. To avoid the ambiguity here, I've changed the phrasing to "no longer exists".

kleinschmidt · 2018-03-30T02:34:31Z

blog/_posts/2018-03-29-missing.md

+or [Apache Arrow](https://arrow.apache.org/docs/memory_layout.html#null-bitmaps)
+use bitmaps equivalent to `BitArray`.
+
+## Safety and propagation by default


Maybe call this "A data scientist's null" (maybe with subtitle like the current title), to aid skimability

....and maybe point out at the very top that poor handling of missing values is a common source of bugs/errors in published work so it's critical to get right (with some links). I think that's a really important motivation for this whole project.

I've added a link. Do you have pointers in fields other than economics?

Regarding the title, I'm not sure using "a data scientist's null" for this section is a good idea given that it applies to all sections (which is mentioned at the top): data scientists need generic, efficient and safe missing values.

kleinschmidt · 2018-03-30T02:38:01Z

blog/_posts/2018-03-29-missing.md

+    1
+    2
+
+Second, the `coalesce` function returns the first non-missing argument, which


I know this name is borrowed from somewhere but I'm not sure where; maybe include footnote/pointer for that?

Borrowed as in the idea comes from another language? SAS has coalesce.

Actually it exists in SQL.

kleinschmidt · 2018-03-30T02:39:25Z

blog/_posts/2018-03-29-missing.md

+See the [manual](https://docs.julialang.org/en/latest/manual/missing/) for more details
+and illustrations about these rules. Let us note that they follow very closely those
+implemented by SQL's `NULL` and R's `NA`, making it easy to translate Julia code into
+SQL requests.


This seems like a really important design decision and should be mentioned at the very top in my opinion

Yeah, I agree. I've added another mention in the intro.

ViralBShah · 2018-03-30T13:36:05Z

Would it be possible to add some timings to the kinds of code patterns that have been reported to be slow - like John Myles White's blog etc.? Just @time ... to show how fast things are.

If there are obvious comparisons that can be done with R and Python missing value handling, that would be of broader interest to readers who are not already Julia users - but that may be a bit more work.

nalimilan · 2018-03-30T15:03:09Z

Would it be possible to add some timings to the kinds of code patterns that have been reported to be slow - like John Myles White's blog etc.? Just @time ... to show how fast things are.

I'll have a look. Hopefully my claims will be confirmed... :-p

If there are obvious comparisons that can be done with R and Python missing value handling, that would be of broader interest to readers who are not already Julia users - but that may be a bit more work.

What kind of comparison do you have in mind? I tend to think the examples are simple enough that readers should be able to identify the equivalent in languages they know very easily.

ChrisRackauckas · 2018-03-30T15:17:07Z

blog/_posts/2018-03-29-missing.md

+
+The first improvement involves optimizations for small `Union` types.
+When type inference detects that a variable can hold values of multiple types but
+that these types are in limited number (as is the case for `Union{Missing,T}`),


This would be a good time to be clear on what this actually means so that way it can be referenced in the future. What is a small union type? What kinds of types can this be done with? Is it only unions of two bitstypes that this works with? I would like to see a footnote here.

I'm not sure I'm the best person to document this, but I've added a note reflecting my (limited) understanding of how it works. I'd appreciate if others could confirm it's correct.

Yeah, we'd need @vtjnash input on what the codegen limitations are here; I'm not sure if there's a limit on the # of union types that codegen will code-split on. Or maybe @Keno can comment on what the new compiler/optimizer does in union code-splitting?

See the footnote I added, there's a constant for that. But it would be good to have somebody check that it's correct.

quinnj · 2018-03-31T21:13:04Z

blog/_posts/2018-03-29-missing.md

+package, which used to be the standard way of representing missing data in Julia.
+`missing` is actually very similar to its predecessor `NA`, but it benefits from many
+improvements in the Julia compiler and language which make it fast, making it possible
+to allow drop the `DataArray` type and using the standard `Array` type instead[^PDA].


allow drop the doesn't make sense.

quinnj · 2018-03-31T21:16:45Z

blog/_posts/2018-03-29-missing.md

+In order to provide a consistent representation of missing values which can be combined
+with any type, Julia 0.7 will use `missing`, an object with no fields which is the only
+instance of the the `Missing` singleton type. This is a normal Julia type with a few
+peculiarities which are detailed below. Values which can be either of type `T` or missing


Not a fan of the word peculiarities here, which carries just a slight negative connotation; like missing had to be special-cased in the core language or something (which isn't true; just small unions have been optimized).

I was thinking about promote_typejoin, but I realize I haven't mentioned it. That's quite technical, so maybe a footnote will be enough. Anyway I can remove "peculiarities".

You have a “the the” here on line 136.

quinnj · 2018-03-31T21:31:18Z

blog/_posts/2018-03-29-missing.md

+
+The first improvement involves optimizations for small `Union` types.
+When type inference detects that a variable can hold values of multiple types but
+that these types are in limited number (as is the case for `Union{Missing,T}`),


Yeah, we'd need @vtjnash input on what the codegen limitations are here; I'm not sure if there's a limit on the # of union types that codegen will code-split on. Or maybe @Keno can comment on what the new compiler/optimizer does in union code-splitting?

quinnj · 2018-03-31T21:32:14Z

blog/_posts/2018-03-29-missing.md

+The second one consists in using a compact memory layout for arrays with `Union`s
+of bits types. The standard `Array` type now uses an optimized memory layout for
+element types which are `Union` of bits types, i.e. immutable types which contain
+no references (see `isbits`). This includes `Missing` and basic types such as


I'm assuming we want to actually link to isbits here?

quinnj · 2018-03-31T21:40:22Z

blog/_posts/2018-03-29-missing.md

+French National Institute for Demographic Studies (Ined), Paris.
+
+**Acknowledgements**: This framework is the result of collective efforts over several
+years. John Myles White lead the reflection around missing values support in Julia


lead => led

quinnj · 2018-03-31T21:41:52Z

blog/_posts/2018-03-29-missing.md

+years. John Myles White lead the reflection around missing values support in Julia
+until 2016. Jameson Nash implemented compiler optimizations, and Jacob Quinn
+implemented the efficient memory layout for arrays. Alex Arslan, Jeff Bezanson,
+Stefan Karpinski, Jameson Nash and Jacob Quinn have been the most central


you can include yourself in this list as well :) I think @davidanthoff would be a good mention as well, with lots of contributions on things to consider w/ missing values.

As the author of the post, I'm not sure I also need to be in the list. ;-)

I'll add David too.

nalimilan · 2018-04-03T09:49:29Z

blog/_posts/2018-03-29-missing.md

+values are involved. This is not insurmountable since masked SIMD instructions allow applying
+an operation only to some values (the non-missing ones). While the absence of SIMD reduces
+noticeably the performance of many operations, it appears that Julia already achieves
+the same speed as vectorized operations in R (which are implemented in C). So there is


This is not completely true currently due to the absence of inlining, which is why I'd rather wait a bit more for performance improvements before publishing the post.

nalimilan · 2018-06-19T18:39:00Z

I've added benchmarks now that inlining works (JuliaLang/julia#27651), and improved a few things. More comments before merging?

davidanthoff · 2018-06-19T18:44:38Z

Could you change the ack to be more the standard academic language ("thanks X, Y and Z for input, not implying they agree" bla bla)? The current version reads a bit as if there was a consensus about the design, which at least in my case is not the case.

nalimilan · 2018-06-19T18:47:06Z

Can you suggest a wording? I think apart from you the other people I've cited generally agree with the design. I can put you in a separate sentence if you prefer, though it could sound a bit weird.

davidanthoff · 2018-06-19T18:49:12Z

I don't really care what language you use, as long as you don't give the impression that I'm on board with this design.

ararslan · 2018-06-19T18:53:12Z

blog/_posts/2018-06-19.md

+    function sum_nonmissing(X::AbstractArray)
+        s = zero(eltype(X))
+        @inbounds @simd for x in X
+            if x !== missing


!ismissing?

Read below... :-p

(Currently it's much slower unfortunately.)

Whoops. 😅

nalimilan · 2018-06-19T19:05:41Z

I don't really care what language you use, as long as you don't give the impression that I'm on board with this design.

Precisely, I don't know how to phrase that. Currently I just say you participated in discussions.

nalimilan · 2018-06-19T20:33:46Z

I've tried something, please tell me whether it's OK for you.

I'll merge tomorrow if there are no additional comments.

davidanthoff · 2018-06-19T21:07:18Z

Sounds good, thanks!

ararslan · 2018-06-19T21:08:47Z

blog/_posts/2018-06-19-missing.md

@@ -489,8 +489,9 @@ Research scientist at the French Institute for Demographic Studies (Ined), Paris

 **Acknowledgements**: This framework is the result of collective efforts over several
 years. John Myles White led the reflection around missing values support in Julia
-until 2016. Jameson Nash implemented compiler optimizations, and Jacob Quinn
-implemented the efficient memory layout for arrays. David Anthoff, Alex Arslan,
+until 2016. Jameson Nash and Keno Fisher implemented compiler optimizations, and Jacob Quinn


Add post on missing values

c9c87c9

ararslan reviewed Mar 29, 2018

View reviewed changes

kleinschmidt reviewed Mar 30, 2018

View reviewed changes

Improvements after review

270c86a

ChrisRackauckas reviewed Mar 30, 2018

View reviewed changes

nalimilan added 2 commits March 31, 2018 15:25

Add footnote anout small Union optimizations

00ccf1c

Move description of the behavior of missing to top

0405dae

quinnj approved these changes Mar 31, 2018

View reviewed changes

nalimilan force-pushed the nl/missing branch from 2aa59cf to 0046add Compare April 2, 2018 21:12

Improvements

877d674

nalimilan force-pushed the nl/missing branch from 0046add to 877d674 Compare April 2, 2018 21:17

nalimilan commented Apr 3, 2018

View reviewed changes

nalimilan mentioned this pull request May 18, 2018

Fail to ignore NaN when calculating mean of an array JuliaLang/julia#4552

Closed

Add benchmarks, some improvements

94d6084

ararslan reviewed Jun 19, 2018

View reviewed changes

Rename file to reflect new date

6e07c83

nalimilan force-pushed the nl/missing branch from 595afe5 to c4840bf Compare June 19, 2018 20:33

ararslan reviewed Jun 19, 2018

View reviewed changes

Improve acknowledgements

aeedb3c

nalimilan force-pushed the nl/missing branch from c4840bf to aeedb3c Compare June 19, 2018 21:25

Add links to issues

1c18f9f

nalimilan merged commit 27923e7 into master Jun 20, 2018

nalimilan deleted the nl/missing branch June 20, 2018 07:57

nalimilan mentioned this pull request Jun 20, 2018

Fix footnotes syntax in missing values blog post #803

Merged

Add post on missing values #770

Add post on missing values #770

Conversation

nalimilan commented Mar 29, 2018

ararslan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ViralBShah commented Mar 30, 2018

nalimilan commented Mar 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan commented Jun 19, 2018

davidanthoff commented Jun 19, 2018

nalimilan commented Jun 19, 2018

davidanthoff commented Jun 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan commented Jun 19, 2018

nalimilan commented Jun 19, 2018

davidanthoff commented Jun 19, 2018

Choose a reason for hiding this comment