`GroupKey` and `DataFrameRow` to `NamedTuple` #2305

goretkin · 2020-06-28T23:22:25Z

https://juliadata.github.io/DataFrames.jl/stable/lib/types/#DataFrames.DataFrameRow mentions using the copy function to convert a row to a NamedTuple, and it just calls NamedTuple(::DataFrameRow)

To convert a GroupKey into a NamedTuple, there is no copy method, but there is NamedTuple(key::GroupKey)

This strikes me as inconsistent, but perhaps there's a reason. My instinct is that the documentation should just suggest using the NamedTuple(::T) method in both cases.

The text was updated successfully, but these errors were encountered:

goretkin · 2020-06-29T06:08:42Z

xref: https://discourse.julialang.org/t/extracting-row-of-dataframe-directly-as-namedtuple/29380/3

bkamins · 2020-06-29T06:57:32Z

The reason of the design is that DataFrameRow is a view, so copy should de-reference it (like in other views in Base).
The documentation uses copy to highlight the fact that DataFrameRow should be essentially treated as-if it were a NamedTuple except that it is a view.

So there are two action points to be decided:

we can update the docs to recommend NamedTuple instead of copy. I understand you think it would be better?
Actually we could add copy to GroupKey. This is a new addition so not API is present, as such additions are considered when needed. Are there some purposes where you would you need it?

goretkin · 2020-06-29T16:22:55Z

we can update the docs to recommend NamedTuple instead of copy. I understand you think it would be better?

Yes, that feels more obvious to me.

Actually we could add copy to GroupKey. This is a new addition so not API is present, as such additions are considered when needed. Are there some purposes where you would you need it?

The main surprise for me was:

df = DataFrame(
  a = repeat([1, 2, 3, 4], outer=[2]),
  b = repeat([2, 1], outer=[4]),
  c = 1:8)

gd = groupby(df, :a)

@show haskey(gd, (a=1,))
@show in((a=1,), keys(gd))

producing the inconsistent

haskey(gd, (a = 1,)) = true
(a = 1,) in keys(gd) = false

And it's because

@show keys(gd) .== Ref((a=1,))
@show NamedTuple.(keys(gd)) .== Ref((a=1,))

keys(gd) .== Ref((a = 1,)) = Bool[0, 0, 0, 0]
NamedTuple.(keys(gd)) .== Ref((a = 1,)) = Bool[1, 0, 0, 0]

So the journey started with discovering GroupKey: (a = 1,) doesn't equal (a=1,). So I figured I should convert it to a NamedTuple. I had previously learned to use copy for a DataFrameRow. I didn't see a copy(::GroupKey), but I found NamedTuple(::GroupKey), and then I also found NamedTuple(::DataFrameRow)`.

In my opinion, the docs should recommend NamedTuple instead of copy, and also ensure that ==(::NamedTuple, ::GroupKey) and vice-versa work so that haskey(df, k) == in(k, keys(df))

bkamins · 2020-06-29T17:19:29Z

OK. I will split it to three PRs as this is independent:

documentation for DataFrameRow
adding Tuple, NamedTuple, Vector, and Array methods
adding == and isequal (for this I have written down some notes to think about below)

Indeed we can add == and isequal definitions.

Intuitively I think that in should use isequal test and use a fast path in case of GroupKeys. Or maybe even we should have == use two-valued logic here, but this is disputable. Not sure what is best (for sure this is intuitive to work this way but doing NamedTuple.(::GroupKeys) will use the standard definition of == so it will lead to an inconsistency). But this can be worked out later

There is also a decision if we should do the same with Tuple form and Dict in the future (see #2281). But I think not. We should treat NamedTuple as a canonical representation.

goretkin · 2020-06-29T17:47:42Z

Thanks!

On a related note, I think having equality between GroupKey: (a = 1,) and (a=1,) means that they should also hash to the same value, too.

The question around isequal vs == has to do with what should happen with missing and NaN, right?

bkamins · 2020-06-29T20:03:44Z

Yes - these are the considerations and they ca get tricky if we want to be efficient at the same time.
That is why I left it for later. Still we can get a problem that gk1 == gk1 is true, but Tuple(gk1) == Tuple(gk1) is missing.
On the other hand if we have two GroupedDataFrames that are identical (but on copied data frames) gk1 == gd2 will be false even if NamedTuple(gk1) == NamedTuple(gk2) is true (we take keys referring to the same group).

In short - some rules have to be defined to what extent we want to have a consistency. Feel free to comment on what you feel would be good if you have an opinion.

In #2308 I cover the stuff that does not require such decisions.

nalimilan · 2020-07-07T16:01:57Z

I don't think == should differ from its behavior for (named) tuples. If you want to skip missing values use isequal like everywhere.

bkamins · 2020-07-07T16:56:49Z

I understand what you say, but in gk1 == gk1 we know that even if there is a missing inside (representing unknown value) actually it must be the same as it is the same gk1, so it is not possible that these two objects would differ after the missing got eventually measured.

goretkin · 2020-07-07T18:14:05Z

This might be an elucidating example

julia> in(NaN, keys(Dict(NaN=>:a)))
true

julia> in(NaN, collect(keys(Dict(NaN=>:a))))
false

julia> in(missing, keys(Dict(missing=>:a)))
true

julia> in(missing, collect(keys(Dict(missing=>:a))))
missing

The behavior is not handled in == in other parts of the ecosystem, but in collection of keys. So I agree with @nalimilan that the behavior for == should be consistent with NamedTuples, and instead what should be defined is the appropriate in(::GroupKey, ::GroupKeys) and in(::NamedTuple, ::GroupKeys)

nalimilan · 2020-07-07T19:39:13Z

Yes, I know it sounds absurd but I'd favor consistency here. If in is defined to use isequal like for AbstractDict (which is the most natural analogy here), we don't really care about == anyway.

bkamins · 2020-08-02T07:38:15Z

Points 1. and 2. from my list are done. So we are left with:

the appropriate in(::GroupKey, ::GroupKeys) and in(::NamedTuple, ::GroupKeys)

i.e. we would make sure that in works correctly, but leave == and isequal as they are now. Right?

nalimilan · 2020-08-04T08:48:28Z

No I wasn't opposed to adding == and isequal methods. Indeed it sounds logical that copy(k::GroupKey) == k, which calls for equality between GroupKey and NamedTuple. I only objected to defining in in a manner which isn't consistent with == or isequal.

bkamins · 2020-08-04T11:34:00Z

So to conclude, is this what we want?

direct == and isequal between NamedTuple and GroupKey should effectively convert GroupKey to NamedTuple and fallback to the defined comparison.
direct == and isequal between Tuple and GroupKey should effectively convert GroupKey to Tuple and fallback to the defined comparison.
== and isequal between GroupKey and GroupKey should check the group index in GroupedDataFrame (so it is always conclusive in particular)
in for GroupKeys should have a special implementation that follows keyset rule (i.e. using haskey)
I would then add haskey to GroupKeys that will use isequal for tests.

If yes then @goretkin - let me please know if you want to go forward with this PR. If not I can implement it without a problem.

nalimilan · 2020-08-04T12:11:00Z

Makes sense. There's one difficulty with transitivity of equality between Tuple and GroupKey, though: two GroupKey objects with different names but same values would compare equal to a tuple, while they wouldn't compare equal to each other. Also, named tuples and tuples don't compare equal. Implementation-wise, since isequal has to be consistent with hash, we can't define isequal so that GroupKey can compare equal to both tuples and named tuple: hash has to be consistent with one type or the other.

bkamins · 2020-08-04T12:20:42Z

OK - so we should error on Tuple when using ==, isequal to make sure users use only what is properly handled, i.e. NamedTuple (this will avoid mistakes from user's side).

Still haskey and in probably can be supported for Tuple as this is not problematic right?

nalimilan · 2020-08-04T13:15:13Z

I'm not aware of any case of == throwing errors, so I'd say no. Otherwise GroupKey might be unusable with some code that expects it to always work.

haskey is OK I'd say, but in would be more problematic since it's tied to the definition of isequal. The inconsistency between haskey and in for tuples isn't great but probably better than the inconsistency between isequal and in.

bkamins · 2020-08-04T13:17:25Z

So for me to be clear. Can you please summarize the rules for Tuple in ==, isequal, haskey and in that you propose here? Thank you.

nalimilan · 2020-08-04T13:29:40Z

No methods for ==, isequal and in (equivalent to always false), but keep the current method for haskey.

bkamins · 2020-08-04T13:44:17Z

OK - clear. Just to be sure:

current method for haskey.

AFAICT we do not have defined it yet, but I understand what we want to do is to do isequal comparison to all fields of GroupKey and if all pass as true then return true. Right?

nalimilan · 2020-08-04T13:56:36Z

Yes. At least that would be consistent with the fact that we allow tuples in getindex.

goretkin · 2020-08-04T14:05:31Z

I have to go back and read more carefully. For now:

Indeed it sounds logical that copy(k::GroupKey) == k

It seems logical, but I don't think this should be a requirement

julia> x = [1, NaN]; copy(x) == x
false

I don't understand what role Tuple should play in this conversation. How I see it is that it's not relevant.

I think there should be a method
in(::NamedTuple, ::GroupKeys)
consistent with
haskey(::GroupDataFrame, ::NamedTuple)

bkamins · 2020-08-04T14:12:12Z

I don't understand what role Tuple should play in this conversation.

The role is that we allow to index GroupedDataFrame with Tuple.

In general: yes - please think of it carefully 😄, there is no rush - and it is easy to make a wrong decision, that we would regret later. Thank you!

nalimilan · 2020-08-04T14:18:44Z

It seems logical, but I don't think this should be a requirement

Yes, my language was a bit sloppy. What I meant is that copy(k::GroupKey) == k should at least return true in some cases (namely when names and elements are ==).

nalimilan · 2020-08-04T14:19:30Z

I think there should be a method
in(::NamedTuple, ::GroupKeys)
consistent with
haskey(::GroupDataFrame, ::NamedTuple)

Yes, we all agree on that AFAICT.

goretkin · 2020-08-04T15:12:20Z

The role is that we allow to index GroupedDataFrame with Tuple.

Ah, okay, I missed that. Thanks for explaining. So then I understand now that this is a question of making getindex(g::GroupedDataFrame, x) and haskey(g::GroupedDataFrame, x) consistent.

Take the definition

function haskey2(c, x)
    try
        c[x]
        return true
    catch
    end
    return false
end

Is it correct to say that currently
haskey2(gd::GroupDataFrame, k::Tuple) could be true while haskey(gd::GroupDataFrame, k::Tuple) would be false?

bkamins · 2020-08-04T15:28:52Z

No. They will be the same currently.

To make things precise. Currently for GroupDataFrame we use Dict internally with Tuple as keys, and all behaviour is inherited from this implementation approach.

I understand that we are talking how to transfer these rules to GroupKeys object. And the conclusion is that it should behave the same as Dict that is currently stored inside a GroupedDataFrame. The cost of this is that isequal will be false for two different keys even if haskey passes for both of them when they point to the same group, but we agreed that this is acceptable.

bkamins · 2020-08-04T15:31:34Z

So we can even define something like (probably with better method signatures):

haskey(gks::GroupKeys, x) = haskey(parent(gks), x)
haskey(gks::GroupKeys, x::Groupkey) = parent(gks) == parent(gk)
in(x, gks::GroupKeys) = haskey(gks, x)

and to not touch definitions of == and isequal.

nalimilan · 2020-08-06T08:41:57Z

haskey(gks::GroupKeys, x::Groupkey) = parent(gks) == parent(gk)

I guess you meant to use ===? One issue is that one GroupKey for a given GroupedDataFrame could be equal to a GroupKey for another GroupedDataFrame. In that case since haskey(NamedTuple(gk1), gd2) will be true, it would make sense for haskey(gk1, gd2) to be true as well.

in(x, gks::GroupKeys) = haskey(gks, x)

So as discussed above this would ensure in and haskey are consistent even for x::Tuple, but in would then be inconsistent with isequal. Not sure which one is best...

bkamins · 2020-08-06T10:47:53Z

Ah - these things are tricky. Thinking of it I meant haskey(gks::GroupKeys, x::GroupKey) = parent(gks) === parent(gk), so GroupKey should be tied to index into the GroupedDataFrame it was created from. I think that having haskey(gk1, gd2) to be possible to return true makes sense, but I think it would do more harm than good (i.e. when someone writes something like this this is likely to be an error, also with == we have a problem with missing).

in would then be inconsistent with isequal

This is what we currently have with haskey for GroupedDataFrame. I think we do not have to isequal test to pass, as we "overextend" indexing and haskey anyway here by allowing different types to be treated equally (this is not allowed in Base because there most likely haskey is associated with collections that are writable, but GroupedDataFrame is read-only so it is a less problem I think).

nalimilan · 2020-08-06T14:21:49Z

I think that having haskey(gk1, gd2) to be possible to return true makes sense, but I think it would do more harm than good (i.e. when someone writes something like this this is likely to be an error, also with == we have a problem with missing).

We could throw an error and see whether people complain (probably not).

This is what we currently have with haskey for GroupedDataFrame.

What do you mean? Currently for tuples we can have this:

julia> haskey(gd, (1,))
true

julia> (1,) in keys(gd)
false

bkamins · 2020-08-06T16:57:06Z

We could throw an error and see whether people complain

I think it is OK to throw an error - this is probably going to catch some bugs (and it is easy enough to convert to NamedTuple to have interop between two GroupedDataFrames).

What do you mean?

Exactly, and we would have both of them return true or both return false, but what I mean is:

julia> df = DataFrame(g=1)
1×1 DataFrame
│ Row │ g     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │

julia> gdf = groupby(df, :g)
GroupedDataFrame with 1 group based on key: g
First Group (1 row): g = 1
│ Row │ g     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │

julia> haskey(gdf, (1,))
true

julia> haskey(gdf, (g=1,))
true

julia> haskey(gdf, keys(gdf)[1])
true

julia> isequal(keys(gdf)[1], (1,))
false

julia> isequal(keys(gdf)[1], (g=1,))
false

So for GroupedDataFrame we already have that haskey is inconsistent with isequal and this is to be expected.

From a docsting of in the relevant part is I think:

Some collections follow a slightly different definition. (...) To test for the presence of a key in a dictionary, use haskey or k in keys(dict).

So what Base actually requires is that haskey and in are giving the same result and this is what I think we should ensure. On the other hand - Base allows collections some flexibility wrt. == and isequal handling observing that it is not guaranteed and different collections can do it a bit differently. At least this is how I would interpret it :).

nalimilan · 2020-08-09T09:27:17Z

So what Base actually requires is that haskey and in are giving the same result and this is what I think we should ensure. On the other hand - Base allows collections some flexibility wrt. == and isequal handling observing that it is not guaranteed and different collections can do it a bit differently. At least this is how I would interpret it :).

"Requires" may be a bit too strong. What the docstring says is that inwith some collections like dicts use isequal instead of ==. This matches what haskey does for dicts. But it doesn't say that other collections should do the same.

Anyway, yeah, it's true that consistency with dicts is a worthy goal. Maybe that's more important than consistency between in and isequal.

bkamins added the doc label Jun 29, 2020

bkamins added this to the 1.0 milestone Jun 29, 2020

bkamins added the feature label Jun 29, 2020

bkamins added the breaking The proposed change is breaking. label Jun 29, 2020

bkamins mentioned this issue Jun 29, 2020

make GroupKey API more consistent #2308

Merged

bkamins mentioned this issue Aug 20, 2020

Index to grouped data frame using Dicts #2281

Merged

bkamins mentioned this issue Aug 29, 2020

add in for GroupKeys #2392

Merged

bkamins closed this as completed in #2392 Oct 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`GroupKey` and `DataFrameRow` to `NamedTuple` #2305

`GroupKey` and `DataFrameRow` to `NamedTuple` #2305

goretkin commented Jun 28, 2020

goretkin commented Jun 29, 2020

bkamins commented Jun 29, 2020

goretkin commented Jun 29, 2020

bkamins commented Jun 29, 2020

goretkin commented Jun 29, 2020

bkamins commented Jun 29, 2020

nalimilan commented Jul 7, 2020

bkamins commented Jul 7, 2020

goretkin commented Jul 7, 2020

nalimilan commented Jul 7, 2020

bkamins commented Aug 2, 2020

nalimilan commented Aug 4, 2020

bkamins commented Aug 4, 2020

nalimilan commented Aug 4, 2020 •

edited

Loading

bkamins commented Aug 4, 2020

nalimilan commented Aug 4, 2020

bkamins commented Aug 4, 2020

nalimilan commented Aug 4, 2020

bkamins commented Aug 4, 2020

nalimilan commented Aug 4, 2020

goretkin commented Aug 4, 2020 •

edited

Loading

bkamins commented Aug 4, 2020

nalimilan commented Aug 4, 2020

nalimilan commented Aug 4, 2020

goretkin commented Aug 4, 2020

bkamins commented Aug 4, 2020

bkamins commented Aug 4, 2020 •

edited

Loading

nalimilan commented Aug 6, 2020

bkamins commented Aug 6, 2020

nalimilan commented Aug 6, 2020

bkamins commented Aug 6, 2020

nalimilan commented Aug 9, 2020

GroupKey and DataFrameRow to NamedTuple #2305

GroupKey and DataFrameRow to NamedTuple #2305

Comments

goretkin commented Jun 28, 2020

goretkin commented Jun 29, 2020

bkamins commented Jun 29, 2020

goretkin commented Jun 29, 2020

bkamins commented Jun 29, 2020

goretkin commented Jun 29, 2020

bkamins commented Jun 29, 2020

nalimilan commented Jul 7, 2020

bkamins commented Jul 7, 2020

goretkin commented Jul 7, 2020

nalimilan commented Jul 7, 2020

bkamins commented Aug 2, 2020

nalimilan commented Aug 4, 2020

bkamins commented Aug 4, 2020

nalimilan commented Aug 4, 2020 • edited Loading

bkamins commented Aug 4, 2020

nalimilan commented Aug 4, 2020

bkamins commented Aug 4, 2020

nalimilan commented Aug 4, 2020

bkamins commented Aug 4, 2020

nalimilan commented Aug 4, 2020

goretkin commented Aug 4, 2020 • edited Loading

bkamins commented Aug 4, 2020

nalimilan commented Aug 4, 2020

nalimilan commented Aug 4, 2020

goretkin commented Aug 4, 2020

bkamins commented Aug 4, 2020

bkamins commented Aug 4, 2020 • edited Loading

nalimilan commented Aug 6, 2020

bkamins commented Aug 6, 2020

nalimilan commented Aug 6, 2020

bkamins commented Aug 6, 2020

nalimilan commented Aug 9, 2020

`GroupKey` and `DataFrameRow` to `NamedTuple` #2305

`GroupKey` and `DataFrameRow` to `NamedTuple` #2305

nalimilan commented Aug 4, 2020 •

edited

Loading

goretkin commented Aug 4, 2020 •

edited

Loading

bkamins commented Aug 4, 2020 •

edited

Loading