-
-
Notifications
You must be signed in to change notification settings - Fork 276
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
blog/_posts/2018-03-29-missing.md
Outdated
[version 0.11](https://discourse.julialang.org/t/dataframes-0-11-released/7296/) | ||
of the [DataFrames](https://github.com/JuliaStats/DataFrames.jl/) package, | ||
which already works on Julia 0.6, even if performance improvements | ||
will only become available with Julia 0.7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing a period at the end of this sentence.
blog/_posts/2018-03-29-missing.md
Outdated
[`NullableArray`](https://github.com/JuliaStats/NullableArrays.jl) had to be used | ||
(similar to `DataArray`). | ||
|
||
For all these reasons, `Nullable{T}` will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will? It already has.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I used the future in the rest of the post to imply that these features are going to be available once the release is out. To avoid the ambiguity here, I've changed the phrasing to "no longer exists".
blog/_posts/2018-03-29-missing.md
Outdated
or [Apache Arrow](https://arrow.apache.org/docs/memory_layout.html#null-bitmaps) | ||
use bitmaps equivalent to `BitArray`. | ||
|
||
## Safety and propagation by default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe call this "A data scientist's null" (maybe with subtitle like the current title), to aid skimability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
....and maybe point out at the very top that poor handling of missing values is a common source of bugs/errors in published work so it's critical to get right (with some links). I think that's a really important motivation for this whole project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a link. Do you have pointers in fields other than economics?
Regarding the title, I'm not sure using "a data scientist's null" for this section is a good idea given that it applies to all sections (which is mentioned at the top): data scientists need generic, efficient and safe missing values.
blog/_posts/2018-03-29-missing.md
Outdated
1 | ||
2 | ||
|
||
Second, the `coalesce` function returns the first non-missing argument, which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this name is borrowed from somewhere but I'm not sure where; maybe include footnote/pointer for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Borrowed as in the idea comes from another language? SAS has coalesce
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it exists in SQL.
blog/_posts/2018-03-29-missing.md
Outdated
See the [manual](https://docs.julialang.org/en/latest/manual/missing/) for more details | ||
and illustrations about these rules. Let us note that they follow very closely those | ||
implemented by SQL's `NULL` and R's `NA`, making it easy to translate Julia code into | ||
SQL requests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a really important design decision and should be mentioned at the very top in my opinion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree. I've added another mention in the intro.
Would it be possible to add some timings to the kinds of code patterns that have been reported to be slow - like John Myles White's blog etc.? Just If there are obvious comparisons that can be done with R and Python missing value handling, that would be of broader interest to readers who are not already Julia users - but that may be a bit more work. |
I'll have a look. Hopefully my claims will be confirmed... :-p
What kind of comparison do you have in mind? I tend to think the examples are simple enough that readers should be able to identify the equivalent in languages they know very easily. |
blog/_posts/2018-03-29-missing.md
Outdated
|
||
The first improvement involves optimizations for small `Union` types. | ||
When type inference detects that a variable can hold values of multiple types but | ||
that these types are in limited number (as is the case for `Union{Missing,T}`), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a good time to be clear on what this actually means so that way it can be referenced in the future. What is a small union type? What kinds of types can this be done with? Is it only unions of two bitstypes that this works with? I would like to see a footnote here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I'm the best person to document this, but I've added a note reflecting my (limited) understanding of how it works. I'd appreciate if others could confirm it's correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the footnote I added, there's a constant for that. But it would be good to have somebody check that it's correct.
blog/_posts/2018-03-29-missing.md
Outdated
package, which used to be the standard way of representing missing data in Julia. | ||
`missing` is actually very similar to its predecessor `NA`, but it benefits from many | ||
improvements in the Julia compiler and language which make it fast, making it possible | ||
to allow drop the `DataArray` type and using the standard `Array` type instead[^PDA]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allow drop the
doesn't make sense.
blog/_posts/2018-03-29-missing.md
Outdated
In order to provide a consistent representation of missing values which can be combined | ||
with any type, Julia 0.7 will use `missing`, an object with no fields which is the only | ||
instance of the the `Missing` singleton type. This is a normal Julia type with a few | ||
peculiarities which are detailed below. Values which can be either of type `T` or missing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a fan of the word peculiarities
here, which carries just a slight negative connotation; like missing
had to be special-cased in the core language or something (which isn't true; just small unions have been optimized).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about promote_typejoin
, but I realize I haven't mentioned it. That's quite technical, so maybe a footnote will be enough. Anyway I can remove "peculiarities".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have a “the the” here on line 136.
blog/_posts/2018-03-29-missing.md
Outdated
|
||
The first improvement involves optimizations for small `Union` types. | ||
When type inference detects that a variable can hold values of multiple types but | ||
that these types are in limited number (as is the case for `Union{Missing,T}`), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blog/_posts/2018-03-29-missing.md
Outdated
The second one consists in using a compact memory layout for arrays with `Union`s | ||
of bits types. The standard `Array` type now uses an optimized memory layout for | ||
element types which are `Union` of bits types, i.e. immutable types which contain | ||
no references (see `isbits`). This includes `Missing` and basic types such as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming we want to actually link to isbits
here?
blog/_posts/2018-03-29-missing.md
Outdated
French National Institute for Demographic Studies (Ined), Paris. | ||
|
||
**Acknowledgements**: This framework is the result of collective efforts over several | ||
years. John Myles White lead the reflection around missing values support in Julia |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lead
=> led
blog/_posts/2018-03-29-missing.md
Outdated
years. John Myles White lead the reflection around missing values support in Julia | ||
until 2016. Jameson Nash implemented compiler optimizations, and Jacob Quinn | ||
implemented the efficient memory layout for arrays. Alex Arslan, Jeff Bezanson, | ||
Stefan Karpinski, Jameson Nash and Jacob Quinn have been the most central |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can include yourself in this list as well :) I think @davidanthoff would be a good mention as well, with lots of contributions on things to consider w/ missing values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the author of the post, I'm not sure I also need to be in the list. ;-)
I'll add David too.
blog/_posts/2018-03-29-missing.md
Outdated
values are involved. This is not insurmountable since masked SIMD instructions allow applying | ||
an operation only to some values (the non-missing ones). While the absence of SIMD reduces | ||
noticeably the performance of many operations, it appears that Julia already achieves | ||
the same speed as vectorized operations in R (which are implemented in C). So there is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not completely true currently due to the absence of inlining, which is why I'd rather wait a bit more for performance improvements before publishing the post.
I've added benchmarks now that inlining works (JuliaLang/julia#27651), and improved a few things. More comments before merging? |
Could you change the ack to be more the standard academic language ("thanks X, Y and Z for input, not implying they agree" bla bla)? The current version reads a bit as if there was a consensus about the design, which at least in my case is not the case. |
Can you suggest a wording? I think apart from you the other people I've cited generally agree with the design. I can put you in a separate sentence if you prefer, though it could sound a bit weird. |
I don't really care what language you use, as long as you don't give the impression that I'm on board with this design. |
blog/_posts/2018-06-19.md
Outdated
function sum_nonmissing(X::AbstractArray) | ||
s = zero(eltype(X)) | ||
@inbounds @simd for x in X | ||
if x !== missing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!ismissing
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Read below... :-p
(Currently it's much slower unfortunately.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops. 😅
Precisely, I don't know how to phrase that. Currently I just say you participated in discussions. |
I've tried something, please tell me whether it's OK for you. I'll merge tomorrow if there are no additional comments. |
Sounds good, thanks! |
blog/_posts/2018-06-19-missing.md
Outdated
@@ -489,8 +489,9 @@ Research scientist at the French Institute for Demographic Studies (Ined), Paris | |||
|
|||
**Acknowledgements**: This framework is the result of collective efforts over several | |||
years. John Myles White led the reflection around missing values support in Julia | |||
until 2016. Jameson Nash implemented compiler optimizations, and Jacob Quinn | |||
implemented the efficient memory layout for arrays. David Anthoff, Alex Arslan, | |||
until 2016. Jameson Nash and Keno Fisher implemented compiler optimizations, and Jacob Quinn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fischer
As promised a long time ago.
Comments welcome on contents, technical details, phrasing, presentation, etc.