Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BREAKING] Handle zero groups #2324

Merged
merged 11 commits into from
Aug 4, 2020
Merged

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Jul 22, 2020

Fix #2322
Fix #2297

This is a major fix to split-apply-combine that introduces many internal changes and some breaking user visible changes.

What is chiefly changed:

  • cols field holds Symbol not Int; this was not strictly needed but as select! can mutate a parent of a GroupedDataFrame it is better to keep Symbols to avoid invalidating the GroupedDataFrame
  • proper handling of column order in transform! and transform
  • proper handling of cases when 0 groups are processed (the only exception left is combine(arg, ::DataFrame) when data frame passed has 0 rows which I leave for later as it is tricky to implement and would only obfuscate the code, and the use case is very limited)

This is breaking so it will require a minor release to go in.

@bkamins
Copy link
Member Author

bkamins commented Jul 22, 2020

CC @pdeffebach - you might want to test it, as the cases are tricky.

@bkamins bkamins added breaking The proposed change is breaking. bug feature grouping priority labels Jul 22, 2020
@bkamins bkamins added this to the 1.0 milestone Jul 22, 2020
@pdeffebach
Copy link
Contributor

Thanks! I just played around with it and I think this is good. It basically just adds new columns so that the returned data frame has the correct names and types. I think this is convenient behavior since it requires less data validation on the user's side.

@bkamins
Copy link
Member Author

bkamins commented Jul 23, 2020

Thank you for looking into this. I will re-read the whole code before @nalimilan goes back on-line to make sure we can merge this when he is available.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Looks mostly good. I have to trust you regarding the places where you added checks for zero groups as the code is really tricky...

src/groupeddataframe/groupeddataframe.jl Outdated Show resolved Hide resolved
src/groupeddataframe/groupeddataframe.jl Show resolved Hide resolved
collect(axes(df, 1)), [1], [nrow(df)], 1, nothing,
Threads.ReentrantLock())
return GroupedDataFrame(df, Symbol[], ones(Int, nrow(df)),
nothing, nothing, nothing, nrow(df) == 0 ? 0 : 1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not continue filling fields with vectors instead of nothing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because they can have 0 or 1 element (this was a bug to fill them before). Now we could conditionally fill them like we fill number of groups, but as filling them later is very cheap anyway I felt that setting them to nothing is OK.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If computing the actual value here is trivial I'd do it, otherwise I agree it's cheap to compute later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave it for later - this way code is more modular (otherwise we hardcode something here and can forget to update it if we change the default way to compute them in 5 years from now).

src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
test/grouping.jl Outdated Show resolved Hide resolved
test/grouping.jl Outdated Show resolved Hide resolved
test/grouping.jl Outdated Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Jul 26, 2020

Thank you for a review.

I have to trust you regarding the places where you added checks for zero groups as the code is really tricky ...

I hope I did it right. The changes are in a mix of very old code and new code, so I tried to cover everything in tests.

@bkamins
Copy link
Member Author

bkamins commented Jul 30, 2020

only coverage fails

@bkamins
Copy link
Member Author

bkamins commented Aug 1, 2020

I have added the test. only coverage fails as usual

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay!

@bkamins
Copy link
Member Author

bkamins commented Aug 4, 2020

No problem - thank you for looking into it!

@bkamins bkamins merged commit c9a1329 into JuliaData:master Aug 4, 2020
@bkamins bkamins deleted the handle_zero_groups branch August 4, 2020 15:58
@bkamins
Copy link
Member Author

bkamins commented Aug 4, 2020

Thank you!

@bkamins bkamins changed the title Handle zero groups [BREAKING] Handle zero groups Aug 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants