Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add ==, isequal <, and isless for DataFrameRow and GroupKey #2669

Merged
merged 16 commits into from
Mar 25, 2021
Merged

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Mar 22, 2021

Fixes #2639.

As usual, before I add tests and update documentation please have a look at the implementation.

It would be good to have a decision on JuliaLang/julia#40142 before finalizing this PR.

In particular we fix a bug according to which two DataFrameRow were true in == if they were the same row from the same data frame even if they contained missing (missing should be produced then).

@bkamins bkamins added this to the 1.0 milestone Mar 22, 2021
@bkamins bkamins added the breaking The proposed change is breaking. label Mar 22, 2021
@bkamins
Copy link
Member Author

bkamins commented Mar 22, 2021

Ah - and it is mildly breaking as we now define hash for DataFrameRow and GroupKey to be consistent with respective NamedTuple hash.

This means in particular that this hash is different than rowhash used in grouping (we could make rowhash consistent with the new hash - @nalimilan - do you think it is worth doing?).

@bkamins bkamins linked an issue Mar 23, 2021 that may be closed by this pull request
src/other/utils.jl Show resolved Hide resolved
src/dataframerow/dataframerow.jl Outdated Show resolved Hide resolved
src/dataframerow/dataframerow.jl Outdated Show resolved Hide resolved
src/groupeddataframe/groupeddataframe.jl Outdated Show resolved Hide resolved
@nalimilan
Copy link
Member

This means in particular that this hash is different than rowhash used in grouping (we could make rowhash consistent with the new hash - @nalimilan - do you think it is worth doing?).

We probably don't care about this since it's internal, right?

@bkamins
Copy link
Member Author

bkamins commented Mar 23, 2021

We probably don't care about this since it's internal, right?

Yes - I have checked that we use rowhash only in one place in the code so it should be OK.

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member Author

bkamins commented Mar 23, 2021

Also rowhash ignores column names for speed, so indeed they do not have to match.

src/dataframerow/dataframerow.jl Outdated Show resolved Hide resolved
src/groupeddataframe/groupeddataframe.jl Outdated Show resolved Hide resolved
@bkamins bkamins changed the title add ==, isequal and isless for DataFrameRow and GroupKey add ==, isequal <, and isless for DataFrameRow and GroupKey Mar 24, 2021
@bkamins
Copy link
Member Author

bkamins commented Mar 25, 2021

OK - I hope I fixed everything :). This was hard (but hopefully now Julia Base will be also more consistent with < handling).

src/dataframerow/dataframerow.jl Outdated Show resolved Hide resolved
test/dataframerow.jl Show resolved Hide resolved
bkamins and others added 2 commits March 25, 2021 14:41
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member Author

bkamins commented Mar 25, 2021

@pdeffebach - this is what we concluded is a good design. Could you please comment on it before merging (as you were against it)?

@pdeffebach
Copy link
Contributor

Sounds good! I'm fine with this.

# table columns are passed as a tuple of vectors to ensure type specialization
rowhash(cols::Tuple{AbstractVector}, r::Int, h::UInt = zero(UInt))::UInt =
hash(cols[1][r], h)
function rowhash(cols::Tuple{Vararg{AbstractVector}}, r::Int, h::UInt = zero(UInt))::UInt
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is interesting that we do not have this covered by tests. I will add some.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it turns out we do not use these functions. I have removed them. @nalimilan - could you please have a look if we would ever need them? (I have never used them)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT one guy removed the last use of findrow and group_rows in a recent PR. :-D #2641

It's weird that coverage didn't notice this. Maybe we only looked at the changed code and not at unrelated parts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were tests that checked correctness of internal functions.

test/dataframerow.jl Outdated Show resolved Hide resolved
test/dataframerow.jl Outdated Show resolved Hide resolved
ngroups, rhashes, gslots, sorted =
row_group_slots(ntuple(i -> df[!, i], ncol(df)), Val(true), groups, false, false)
rperm, starts, stops = compute_indices(groups, ngroups)
return RowGroupDict(df, rhashes, gslots, groups, rperm, starts, stops)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think RowGroupDict can also be removed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed - I am running tests to double check. It is astonishing how much we have reworked internally in this release.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - all seems clean. I will merge the PR after CI passes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that this file contains only grouping code, we will be able to move it to the corresponding folder and rename it without touching anything else. :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK will move it in this PR as otherwise we will forget. Also nonunique uses it, but I think it is not a problem.

@bkamins bkamins merged commit 1d3f31b into main Mar 25, 2021
@bkamins bkamins deleted the nt_like_types branch March 25, 2021 21:00
@bkamins
Copy link
Member Author

bkamins commented Mar 25, 2021

Thank you!

@pdeffebach
Copy link
Contributor

This allows

(a = 1, b = 2) in eachrow(df)

right?

@bkamins
Copy link
Member Author

bkamins commented Mar 25, 2021

Right, and even more e.g.:

julia> df = DataFrame(a=1:5, b=2:6)
5×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      2
   2 │     2      3
   3 │     3      4
   4 │     4      5
   5 │     5      6

julia> (a=1, b=2) in eachrow(df)
true

julia> Ref((a=4, b=4)) .< eachrow(df)
5-element BitVector:
 0
 0
 0
 1
 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking The proposed change is breaking. bug feature grouping
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFrameRow and NamedTuple comparisons GroupKey should be comparable between DataFrames
3 participants