Skip to content
This repository has been archived by the owner on May 4, 2019. It is now read-only.

Roadmap to 0.1 #148

Open
1 of 14 tasks
davidagold opened this issue Sep 18, 2016 · 18 comments
Open
1 of 14 tasks

Roadmap to 0.1 #148

davidagold opened this issue Sep 18, 2016 · 18 comments
Milestone

Comments

@davidagold
Copy link
Contributor

davidagold commented Sep 18, 2016

This package has become sufficiently well-adopted that a 0.1 release effort seems appropriate. Here's the TO-DO I'm thinking of (suggestions welcome):

I don't know whether to close #1 in favor of this or keep the former around as a general roadmap discussion for the package. All of those items in #1 turned out to be 0.0.1 items, and we've been using the 0.0.x series to feel out how NullableArrays.jl plays with the rest of the Julia stats/data ecosystems and find out what in particular NullableArrays are/will be used for. I'm still working out how best to articulate what I think the findings of the 0.0.x series have been. I only have three strong feelings:

  • One of the most useful things I've found about NullableArrays is that they provide type-inferable iteration over arrays that may contain missing values
  • Lifting facilities for higher-order functions (map, broadcast, reduce) should be built-in and possibly on by default
  • Lifting a method by extending it over Nullable arguments is probably not the way to go. I'd prefer if attempts to do this were made elsewhere.
@davidagold
Copy link
Contributor Author

Also, maybe NullableArrays 0.1 could serve as a deprecation period for Julia 0.4 support?

@nalimilan
Copy link
Member

One of the "philosophical" issues I'd like to add to the list is whether NullableArrays are an efficient replacement for Array{Nullable}, or whether they behave differently; and if the latter, what clear rules we should follow to make divergences predictable for users.

@davidagold
Copy link
Contributor Author

@nalimilan IMO, NullableArrays are mostly efficient replacement for Array{Nullable} and should mostly have identical behavior. The exceptions I'm thinking of are things like map, reduce, mean, etc. in which I think we should supply lifting semantics by default. To an extent, I am using this package as a soapbox for how I think lifting is best achieved until it's supported by the compiler (if that ever happens): automatically inside the context of functions that act containers of Nullables, but not in methods that are re-defined over Nullables themselves. Everything else should be like Array{Nullable}.

@nalimilan
Copy link
Member

The exceptions I'm thinking of are things like map, reduce, mean, etc. in which I think we should supply lifting semantics by default.

Yes, but then were does this list end? It's not great to tell users that they have to learn a (possibly evolving) list of functions which don't behave normally.

@davidanthoff
Copy link

I just wrote the same thing in a different PR, but given how non-related that PR is it might be worth repeating this here: I'm really not a fan of a design where the semantics of a function change depending on the call context. I think the principle that it is enough to look at the function name and the types of the arguments to figure out which method will be called is a good one, I really wouldn't want a system where you also need to figure out in what context the function is called to determine its behavior.

@davidagold
Copy link
Contributor Author

Yes, but then were does this list end? It's not great to tell users that they have to learn a (possibly evolving) list of functions which don't behave normally.

Suppose somebody wants to do map(f, X::NullableArray). I don't think "scalar" lifting by method extension is viable, and I don't want to encourage it. This is to say that I won't provide any lifted method f(x::Nullable), and I don't think the user should try to implement it themselves. IMO, using a higher-order lift function is far more general. So I'd much prefer the user do something like map(lift(f), X). But then map(f, X) will be essentially useless without automatic lifting, since, if the user is behaving as I want her to, it will invariably result in a MethodError. I don't think map(f, X) should be useless, hence automatic lifting.

Are those things that I wish to avoid -- i.e., lifting via method extension, uselessness of default map(f, X::NullableArray) -- much worse than a situation in which some higher-order functions automatically lift and some don't? This is hard to say. But I do think that there is a difference between the issue of lifting a higher-order function like map over a NullableArray and lifting over arbitrary "scalar" arguments (by scalar arguments I just mean non-array arguments, so can include tuples).

I'd be fine restricting automatic lifting to a small list of higher-order functions and requiring mean(dropnull(X)) for things like mean. Automatic lifting in higher-order functions, dropnull, and automatically lifting in querying tabular data structures would probably go a long ways. And for any edge cases not covered by those, users can do lift(f, xs...) or pass lift(f) to a higher-order function for which there isn't automatic lifting.

We can also nix automatic lifting and require users to pass lift(f) to things like map. I'm fine with that in principle, but it feels silly to give users map and tell them that map(f, X) just doesn't work.

@davidagold
Copy link
Contributor Author

I'm really not a fan of a design where the semantics of a function change depending on the call context.

In the case of map(f, X::NullableArray), the semantics of f don't change even if we institute automatic lifting. It's really the semantics of map. And note that these are the same semantics for map as what we'd get if the compiler did automatic scalar lifting, which IIUC is something none of us are intrinsically opposed to. So I don't see the trouble in having map(f, X::NullableArray) lower to map(lift(f), X::NullableArray).

@davidanthoff
Copy link

davidanthoff commented Sep 23, 2016

I'm very nervous about any automatic lifting stuff. I would much prefer that we start with the white list approach, i.e. simply add as many methods to functions that work for Nullable scalars as we can. This could be either in Base or in NullableOps. Once we have done that for a couple of months we would have a much better sense whether a) an automatic lifting story would work or not, simply because we would have gone through many, many concrete examples and b) whether there is actually a finite list of functions that need to be lifted that cover most cases. We might find that they all fall into the same pattern and that an automatic lifting would work, or we might find more examples like the 3VL where that is not the case. I don't think such a strategy would lock us in in any way: if it turns out that all these manual methods can be replaced with something automatic, great (that is what essentially happened with the vectorized syntax in Base). On the other hand, if automatic lifting is added and then later discovered to really not work well because we overlooked cases, it would be much, much more difficult to take automatic lifting away again.

@davidagold
Copy link
Contributor Author

@davidanthoff Can you help me to understand your concern? I'm proposing "automatic lifting" in the context of higher-order functions such as map. So, if f(x::T) is an extant method, map(f, X::NullableArray{T}) will lower to map(lift(f), X), where lift(f)(x::Nullable{T}) has the same semantics as the natural extension f(x::Nullable{T}). The difference is that we need only define one lift(f::F) method to cover all cases of lifting a method f over a single argument x::T, as opposed to defining an f(x::Nullable{T}) method for all such f. But then my proposed semantics for map(f, X::NullableArray{T}) are precisely the same as the semantics we'd have if we just defined f(x::Nullable{T}). What would be "much, much more difficult" about taking away "automatic lifting"?

In fact, I think what the above shows is that the notion of "automatic lifting" is, to an extent, a red herring. We're both trying to achieve the same semantics in map(f, X::NullableArray{T}), and one might reasonably describe such semantics as "automatic lifting". One way to achieve these semantics is to "automatically" provide as many lifted methods as you can; another way is to "automatically" lower map(f, X) to map(lift(f), X).

Now, I think you're right that non-standard lifting semantics such as 3VL are a big concern. It's not really clear how best to implement such semantics using lift(f) in a most general way. But again, I think we're, for the most part, trying to achieve the same semantics, so I think we can work on a solution to the lift/3VL problem while others work extending methods by hand.

@davidanthoff
Copy link

Well, one issue is that I haven't seen a solution to the 3VL issue. How would we handle something like map(i->i & false, X::NullableArray{Bool})? This is assuming that the & operator has the proper 3VL semantics.

I worry that there are more cases like 3VL where the correct lifting semantics is not to return a null value if the input is null.

I agree that we want the same semantics for probably 90% of functions. But until we are really certain that we want the same semantics for a 100% I'm hesitant to have things like map do this automatically.

@davidagold
Copy link
Contributor Author

davidagold commented Sep 24, 2016

I don't know. My current solution would be to @collect i in X |> select(i & false), since the macro context allows us to do syntax manipulations that would result in 3VL behavior. Like I said, still an open problem. There are open problems associated with the lifting by method extension approach, too. We can work on solving them simultaneously.

I don't think we'll ever conclude that we want to have the same semantics for 100% of functions. I'm pretty sure that holding out for 100% certainty for anything in any context is paralyzing. We can have these two solutions progress simultaneously, and the costs to doing so seem to be minimal -- if one wants to implement non-standard lifting semantics by method extension, one can always just do map(f, X, lift=false). I haven't yet seen any egregious problems that would be encountered by offering automatic lifting in higher-order functions.

@davidanthoff
Copy link

Maybe the even more relevant example would be map(i->isnull(i) ? something : somethingelse, X::NullabelArray{T}). That would also not work, right? I would find it highly confusing if that didn't work...

There are open problems associated with the lifting by method extension approach, too.

What are those? Other than a lot of work?

@davidagold
Copy link
Contributor Author

davidagold commented Sep 24, 2016

That would also not work, right?

Not unless one set lift=false, which is not ideal but okay to me.

What are those? Other than a lot of work?

There's everything that was mentioned in JuliaLang/METADATA.jl#6468. Those are real concerns.

Here's another problem. An objective, as I understand it, of lifting by method extension is to find a minimal set of methods to lift (i.e. for which to define lifted methods) that would support a majority of user-defined methods. Thus, if we had a definition for (*){T}(x::Nullable{T}, y::Nullable{T}, one could define f(x,y) = x * y and not have to define a lifted method for f. The lifting carries through this method for f since it is defined generically. The converse(?) of this is that, in order for user-defined functions to take advantage of methods that are pre-lifted, so to speak, they must be defined generically. You can't give f(x::Int, y::Int) and f(x::String, y::String) different implementations and lift them both by relying on lifted methods used in those implementations. In order to call f on an argument with signature (x::Nullable{Int}, y::Nullable{Int}) or (x::Nullable{String}, y::Nullable{String}) one can either (i) define those methods or (ii) define a generic f(x, y) method and choose one implementation for it. To summarize the predicament, if you're going to rely on a minimal set of lifted operators to support generic lifting of user-defined functions, those user-defined functions essentially have to give up much of multiple dispatch.

Is this enough of a problem to preclude the viability of lifting by method extension? I don't know. Is having to do map(i -> isnull(i) ? ..., lift=false) enough to preclude the viability of lowering to lift(f) by default? I don't know. But I think we can explore both options at the same time.

@davidanthoff
Copy link

So in the end the semantics of map, a query expression (I assume your example with @collect is jplyr syntax?) and a generator or comprehension would all be different if the source is a NullableArray, right? map would lift everything, the query expression would lift some things but leave other expressions like isnull or 3VL alone, and a generator would not lift anything. I really hope we don't end up with that kind of design, I think having all these special cases that no one can figure out without reading the manual is something we should try to avoid. I think in a well designed system all three of these examples would exhibit exactly the same behavior.

There's everything that was mentioned in JuliaLang/METADATA.jl#6468. Those are real concerns.

The only thing that came up there is that lifted methods for base functions should not live in packages. Isn't that easily solved by eventually moving these into base?

An objective, as I understand it, of lifting by method extension is to find a minimal set of methods to lift (i.e. for which to define lifted methods) that would support a majority of user-defined methods.

I never thought that was the goal. I think the white listing approach means that if you have a user-defined function and you want it to work with Nullable, you have to add another method. Having said that, one could probably define a macro like this

@lifted function foo(a::AbstractString)
    # do something
end

that would automatically add a lifted method to foo that has a::Nullable{AbstractString} as the argument, which seems a pretty simple way for function authors to add lifted versions of their methods.

One thing that would make sense imo is to have the lift optional keyword argument for map, but have the default value be false. I think that could be useful, and at the same time it would not lead to all these differences between map, queries, generators and comprehensions.

@davidagold
Copy link
Contributor Author

I really hope we don't end up with that kind of design, I think having all these special cases that no one can figure out without reading the manual is something we should try to avoid.

Comprehensions are supposed to always return Arrays, IIUC. If that's the case, they won't ever have precisely the same behavior as map over a NullableArray. As far as map and queries go, I'm not too concerned about their having distinct semantics. So I think this is just a point where you and I differ in opinion.

which seems a pretty simple way for function authors to add lifted versions of their methods.

Sure, this might work.

One thing that would make sense imo is to have the lift optional keyword argument for map, but have the default value be false.

Sure. The more I think about this, the more I'm on board with it. Explicitly requiring map(lift(f), X) could be a good thing.

@nalimilan
Copy link
Member

One thing that would make sense imo is to have the lift optional keyword argument for map, but have the default value be false.

Sure. The more I think about this, the more I'm on board with it. Explicitly requiring map(lift(f), X) could be a good thing.

I think that would be a reasonable starting point at least. That way, NullableArray and Array{Nullable} would behave the same by default. Then if it turns out to be too annoying (and we don't find another convenient approach), we can always change to lifting by default.

But map(f, x::NullableArray) should probably return a NullableArray when f returns a Nullable anyway. That doesn't break the AbstractArray interface.

@davidanthoff
Copy link

But map(f, x::NullableArray) should probably return a NullableArray when f returns a Nullable anyway. That doesn't break the AbstractArray interface.

Agreed. For complete consistency I almost feel map(f, x::Array) should return a NullableArray if f returns a Nullable, i.e. even if x is not a NullableArray. But obviously that would require NullableArray to live in base... So maybe something to consider way, way in the future ;)

@nalimilan
Copy link
Member

Maybe add #167 to the list, since it could give surprising results when concatenating data frames mixing Array and NullableArray.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants