Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GroupKey and DataFrameRow to NamedTuple #2305

Closed
goretkin opened this issue Jun 28, 2020 · 32 comments · Fixed by #2392
Closed

GroupKey and DataFrameRow to NamedTuple #2305

goretkin opened this issue Jun 28, 2020 · 32 comments · Fixed by #2392
Labels
breaking The proposed change is breaking. doc feature
Milestone

Comments

@goretkin
Copy link

https://juliadata.github.io/DataFrames.jl/stable/lib/types/#DataFrames.DataFrameRow mentions using the copy function to convert a row to a NamedTuple, and it just calls NamedTuple(::DataFrameRow)

To convert a GroupKey into a NamedTuple, there is no copy method, but there is NamedTuple(key::GroupKey)

This strikes me as inconsistent, but perhaps there's a reason. My instinct is that the documentation should just suggest using the NamedTuple(::T) method in both cases.

@goretkin
Copy link
Author

@bkamins bkamins added the doc label Jun 29, 2020
@bkamins bkamins added this to the 1.0 milestone Jun 29, 2020
@bkamins
Copy link
Member

bkamins commented Jun 29, 2020

The reason of the design is that DataFrameRow is a view, so copy should de-reference it (like in other views in Base).
The documentation uses copy to highlight the fact that DataFrameRow should be essentially treated as-if it were a NamedTuple except that it is a view.

So there are two action points to be decided:

  1. we can update the docs to recommend NamedTuple instead of copy. I understand you think it would be better?
  2. Actually we could add copy to GroupKey. This is a new addition so not API is present, as such additions are considered when needed. Are there some purposes where you would you need it?

@goretkin
Copy link
Author

we can update the docs to recommend NamedTuple instead of copy. I understand you think it would be better?

Yes, that feels more obvious to me.

Actually we could add copy to GroupKey. This is a new addition so not API is present, as such additions are considered when needed. Are there some purposes where you would you need it?

The main surprise for me was:

df = DataFrame(
  a = repeat([1, 2, 3, 4], outer=[2]),
  b = repeat([2, 1], outer=[4]),
  c = 1:8)

gd = groupby(df, :a)

@show haskey(gd, (a=1,))
@show in((a=1,), keys(gd))

producing the inconsistent

haskey(gd, (a = 1,)) = true
(a = 1,) in keys(gd) = false

And it's because

@show keys(gd) .== Ref((a=1,))
@show NamedTuple.(keys(gd)) .== Ref((a=1,))

keys(gd) .== Ref((a = 1,)) = Bool[0, 0, 0, 0]
NamedTuple.(keys(gd)) .== Ref((a = 1,)) = Bool[1, 0, 0, 0]

So the journey started with discovering GroupKey: (a = 1,) doesn't equal (a=1,). So I figured I should convert it to a NamedTuple. I had previously learned to use copy for a DataFrameRow. I didn't see a copy(::GroupKey), but I found NamedTuple(::GroupKey), and then I also found NamedTuple(::DataFrameRow)`.

In my opinion, the docs should recommend NamedTuple instead of copy, and also ensure that ==(::NamedTuple, ::GroupKey) and vice-versa work so that haskey(df, k) == in(k, keys(df))

@bkamins
Copy link
Member

bkamins commented Jun 29, 2020

OK. I will split it to three PRs as this is independent:

  1. documentation for DataFrameRow
  2. adding Tuple, NamedTuple, Vector, and Array methods
  3. adding == and isequal (for this I have written down some notes to think about below)

Indeed we can add == and isequal definitions.

Intuitively I think that in should use isequal test and use a fast path in case of GroupKeys. Or maybe even we should have == use two-valued logic here, but this is disputable. Not sure what is best (for sure this is intuitive to work this way but doing NamedTuple.(::GroupKeys) will use the standard definition of == so it will lead to an inconsistency). But this can be worked out later

There is also a decision if we should do the same with Tuple form and Dict in the future (see #2281). But I think not. We should treat NamedTuple as a canonical representation.

@bkamins bkamins added the breaking The proposed change is breaking. label Jun 29, 2020
@goretkin
Copy link
Author

Thanks!

On a related note, I think having equality between GroupKey: (a = 1,) and (a=1,) means that they should also hash to the same value, too.

The question around isequal vs == has to do with what should happen with missing and NaN, right?

@bkamins
Copy link
Member

bkamins commented Jun 29, 2020

Yes - these are the considerations and they ca get tricky if we want to be efficient at the same time.
That is why I left it for later. Still we can get a problem that gk1 == gk1 is true, but Tuple(gk1) == Tuple(gk1) is missing.
On the other hand if we have two GroupedDataFrames that are identical (but on copied data frames) gk1 == gd2 will be false even if NamedTuple(gk1) == NamedTuple(gk2) is true (we take keys referring to the same group).

In short - some rules have to be defined to what extent we want to have a consistency. Feel free to comment on what you feel would be good if you have an opinion.

In #2308 I cover the stuff that does not require such decisions.

@nalimilan
Copy link
Member

I don't think == should differ from its behavior for (named) tuples. If you want to skip missing values use isequal like everywhere.

@bkamins
Copy link
Member

bkamins commented Jul 7, 2020

I understand what you say, but in gk1 == gk1 we know that even if there is a missing inside (representing unknown value) actually it must be the same as it is the same gk1, so it is not possible that these two objects would differ after the missing got eventually measured.

@goretkin
Copy link
Author

goretkin commented Jul 7, 2020

This might be an elucidating example

julia> in(NaN, keys(Dict(NaN=>:a)))
true

julia> in(NaN, collect(keys(Dict(NaN=>:a))))
false

julia> in(missing, keys(Dict(missing=>:a)))
true

julia> in(missing, collect(keys(Dict(missing=>:a))))
missing

The behavior is not handled in == in other parts of the ecosystem, but in collection of keys. So I agree with @nalimilan that the behavior for == should be consistent with NamedTuples, and instead what should be defined is the appropriate in(::GroupKey, ::GroupKeys) and in(::NamedTuple, ::GroupKeys)

@nalimilan
Copy link
Member

Yes, I know it sounds absurd but I'd favor consistency here. If in is defined to use isequal like for AbstractDict (which is the most natural analogy here), we don't really care about == anyway.

@bkamins
Copy link
Member

bkamins commented Aug 2, 2020

Points 1. and 2. from my list are done. So we are left with:

the appropriate in(::GroupKey, ::GroupKeys) and in(::NamedTuple, ::GroupKeys)

i.e. we would make sure that in works correctly, but leave == and isequal as they are now. Right?

@nalimilan
Copy link
Member

No I wasn't opposed to adding == and isequal methods. Indeed it sounds logical that copy(k::GroupKey) == k, which calls for equality between GroupKey and NamedTuple. I only objected to defining in in a manner which isn't consistent with == or isequal.

@bkamins
Copy link
Member

bkamins commented Aug 4, 2020

So to conclude, is this what we want?

  1. direct == and isequal between NamedTuple and GroupKey should effectively convert GroupKey to NamedTuple and fallback to the defined comparison.
  2. direct == and isequal between Tuple and GroupKey should effectively convert GroupKey to Tuple and fallback to the defined comparison.
  3. == and isequal between GroupKey and GroupKey should check the group index in GroupedDataFrame (so it is always conclusive in particular)
  4. in for GroupKeys should have a special implementation that follows keyset rule (i.e. using haskey)
  5. I would then add haskey to GroupKeys that will use isequal for tests.

If yes then @goretkin - let me please know if you want to go forward with this PR. If not I can implement it without a problem.

@nalimilan
Copy link
Member

nalimilan commented Aug 4, 2020

Makes sense. There's one difficulty with transitivity of equality between Tuple and GroupKey, though: two GroupKey objects with different names but same values would compare equal to a tuple, while they wouldn't compare equal to each other. Also, named tuples and tuples don't compare equal. Implementation-wise, since isequal has to be consistent with hash, we can't define isequal so that GroupKey can compare equal to both tuples and named tuple: hash has to be consistent with one type or the other.

@bkamins
Copy link
Member

bkamins commented Aug 4, 2020

OK - so we should error on Tuple when using ==, isequal to make sure users use only what is properly handled, i.e. NamedTuple (this will avoid mistakes from user's side).

Still haskey and in probably can be supported for Tuple as this is not problematic right?

@nalimilan
Copy link
Member

I'm not aware of any case of == throwing errors, so I'd say no. Otherwise GroupKey might be unusable with some code that expects it to always work.

haskey is OK I'd say, but in would be more problematic since it's tied to the definition of isequal. The inconsistency between haskey and in for tuples isn't great but probably better than the inconsistency between isequal and in.

@bkamins
Copy link
Member

bkamins commented Aug 4, 2020

So for me to be clear. Can you please summarize the rules for Tuple in ==, isequal, haskey and in that you propose here? Thank you.

@nalimilan
Copy link
Member

No methods for ==, isequal and in (equivalent to always false), but keep the current method for haskey.

@bkamins
Copy link
Member

bkamins commented Aug 4, 2020

OK - clear. Just to be sure:

current method for haskey.

AFAICT we do not have defined it yet, but I understand what we want to do is to do isequal comparison to all fields of GroupKey and if all pass as true then return true. Right?

@nalimilan
Copy link
Member

Yes. At least that would be consistent with the fact that we allow tuples in getindex.

@goretkin
Copy link
Author

goretkin commented Aug 4, 2020

I have to go back and read more carefully. For now:

Indeed it sounds logical that copy(k::GroupKey) == k

It seems logical, but I don't think this should be a requirement

julia> x = [1, NaN]; copy(x) == x
false

I don't understand what role Tuple should play in this conversation. How I see it is that it's not relevant.

I think there should be a method
in(::NamedTuple, ::GroupKeys)
consistent with
haskey(::GroupDataFrame, ::NamedTuple)

@bkamins
Copy link
Member

bkamins commented Aug 4, 2020

I don't understand what role Tuple should play in this conversation.

The role is that we allow to index GroupedDataFrame with Tuple.

In general: yes - please think of it carefully 😄, there is no rush - and it is easy to make a wrong decision, that we would regret later. Thank you!

@nalimilan
Copy link
Member

It seems logical, but I don't think this should be a requirement

Yes, my language was a bit sloppy. What I meant is that copy(k::GroupKey) == k should at least return true in some cases (namely when names and elements are ==).

@nalimilan
Copy link
Member

I think there should be a method
in(::NamedTuple, ::GroupKeys)
consistent with
haskey(::GroupDataFrame, ::NamedTuple)

Yes, we all agree on that AFAICT.

@goretkin
Copy link
Author

goretkin commented Aug 4, 2020

The role is that we allow to index GroupedDataFrame with Tuple.

Ah, okay, I missed that. Thanks for explaining. So then I understand now that this is a question of making getindex(g::GroupedDataFrame, x) and haskey(g::GroupedDataFrame, x) consistent.

Take the definition

function haskey2(c, x)
    try
        c[x]
        return true
    catch
    end
    return false
end

Is it correct to say that currently
haskey2(gd::GroupDataFrame, k::Tuple) could be true while haskey(gd::GroupDataFrame, k::Tuple) would be false?

@bkamins
Copy link
Member

bkamins commented Aug 4, 2020

No. They will be the same currently.

To make things precise. Currently for GroupDataFrame we use Dict internally with Tuple as keys, and all behaviour is inherited from this implementation approach.

I understand that we are talking how to transfer these rules to GroupKeys object. And the conclusion is that it should behave the same as Dict that is currently stored inside a GroupedDataFrame. The cost of this is that isequal will be false for two different keys even if haskey passes for both of them when they point to the same group, but we agreed that this is acceptable.

@bkamins
Copy link
Member

bkamins commented Aug 4, 2020

So we can even define something like (probably with better method signatures):

haskey(gks::GroupKeys, x) = haskey(parent(gks), x)
haskey(gks::GroupKeys, x::Groupkey) = parent(gks) == parent(gk)
in(x, gks::GroupKeys) = haskey(gks, x)

and to not touch definitions of == and isequal.

@nalimilan
Copy link
Member

haskey(gks::GroupKeys, x::Groupkey) = parent(gks) == parent(gk)

I guess you meant to use ===? One issue is that one GroupKey for a given GroupedDataFrame could be equal to a GroupKey for another GroupedDataFrame. In that case since haskey(NamedTuple(gk1), gd2) will be true, it would make sense for haskey(gk1, gd2) to be true as well.

in(x, gks::GroupKeys) = haskey(gks, x)

So as discussed above this would ensure in and haskey are consistent even for x::Tuple, but in would then be inconsistent with isequal. Not sure which one is best...

@bkamins
Copy link
Member

bkamins commented Aug 6, 2020

Ah - these things are tricky. Thinking of it I meant haskey(gks::GroupKeys, x::GroupKey) = parent(gks) === parent(gk), so GroupKey should be tied to index into the GroupedDataFrame it was created from. I think that having haskey(gk1, gd2) to be possible to return true makes sense, but I think it would do more harm than good (i.e. when someone writes something like this this is likely to be an error, also with == we have a problem with missing).

in would then be inconsistent with isequal

This is what we currently have with haskey for GroupedDataFrame. I think we do not have to isequal test to pass, as we "overextend" indexing and haskey anyway here by allowing different types to be treated equally (this is not allowed in Base because there most likely haskey is associated with collections that are writable, but GroupedDataFrame is read-only so it is a less problem I think).

@nalimilan
Copy link
Member

I think that having haskey(gk1, gd2) to be possible to return true makes sense, but I think it would do more harm than good (i.e. when someone writes something like this this is likely to be an error, also with == we have a problem with missing).

We could throw an error and see whether people complain (probably not).

This is what we currently have with haskey for GroupedDataFrame.

What do you mean? Currently for tuples we can have this:

julia> haskey(gd, (1,))
true

julia> (1,) in keys(gd)
false

@bkamins
Copy link
Member

bkamins commented Aug 6, 2020

We could throw an error and see whether people complain

I think it is OK to throw an error - this is probably going to catch some bugs (and it is easy enough to convert to NamedTuple to have interop between two GroupedDataFrames).

What do you mean?

Exactly, and we would have both of them return true or both return false, but what I mean is:

julia> df = DataFrame(g=1)
1×1 DataFrame
│ Row │ g     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │

julia> gdf = groupby(df, :g)
GroupedDataFrame with 1 group based on key: g
First Group (1 row): g = 1
│ Row │ g     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │

julia> haskey(gdf, (1,))
true

julia> haskey(gdf, (g=1,))
true

julia> haskey(gdf, keys(gdf)[1])
true

julia> isequal(keys(gdf)[1], (1,))
false

julia> isequal(keys(gdf)[1], (g=1,))
false

So for GroupedDataFrame we already have that haskey is inconsistent with isequal and this is to be expected.

From a docsting of in the relevant part is I think:

Some collections follow a slightly different definition. (...) To test for the presence of a key in a dictionary, use haskey or k in keys(dict).

So what Base actually requires is that haskey and in are giving the same result and this is what I think we should ensure. On the other hand - Base allows collections some flexibility wrt. == and isequal handling observing that it is not guaranteed and different collections can do it a bit differently. At least this is how I would interpret it :).

@nalimilan
Copy link
Member

So what Base actually requires is that haskey and in are giving the same result and this is what I think we should ensure. On the other hand - Base allows collections some flexibility wrt. == and isequal handling observing that it is not guaranteed and different collections can do it a bit differently. At least this is how I would interpret it :).

"Requires" may be a bit too strong. What the docstring says is that inwith some collections like dicts use isequal instead of ==. This matches what haskey does for dicts. But it doesn't say that other collections should do the same.

Anyway, yeah, it's true that consistency with dicts is a worthy goal. Maybe that's more important than consistency between in and isequal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking The proposed change is breaking. doc feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants