Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Implement n_distinct() for multiple arguments using duckdb structs #122

Merged
merged 7 commits into from
Apr 29, 2024

Conversation

lschneiderbauer
Copy link
Contributor

@lschneiderbauer lschneiderbauer commented Mar 23, 2024

This PR attempts to adress #110, implementing n_distinct() making use of duckdb's structs.

I don't have any experience in writing dbplyr SQL translations, in particular I have no idea of the 'best practices', but I tried to mimic as much from the already existing code as I could.

A couple of notes:

  • duckdb's struct fields need to be named. In this case, the naming is irrelevant and I chose to name them 'v1', 'v2', 'v3', ... and so on, depending on the number of arguments passed to n_distinct(). If there is an alternative prefered way to do it, please let me know.
  • I was making use of glue::glue(). Although dbplyr depends on glue, and in this function we are assuming that dbplyr is installed, I am not sure if I am allowed to assume that this function exists. Maybe I should not make use of it?
  • I also added unit tests at places where I thought they make sense. Let me know if I should change them or relocate them, etc..

Thanks! :)

Closes #110

Copy link
Collaborator

@krlmlr krlmlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! A small integration test (what does n_distinct() actually do with a real dummy table) would be useful too.

Sorry about the conflicts.

R/backend-dbplyr__duckdb_connection.R Outdated Show resolved Hide resolved
@lschneiderbauer
Copy link
Contributor Author

lschneiderbauer commented Mar 30, 2024

A small integration test (what does n_distinct() actually do with a real dummy table) would be useful too.

I added a bunch of tests to check the actual computation.

In the meantime I also realized that my assumption that duckdb would use a na.rm = TRUE semantics in this case is actually wrong: NULLs within structs seem to actually be counted as a single unique entity. The added test results suggest that.
So I removed the wrong check_na_rm() and opted to throw an error if the na.rm = TRUE is passed. Since na.rm = FALSE is the default this seemed appropriate.
If one wanted to implemented a na.rm = TRUE behavior, using the FILTER clause might be a way to go, but according to the documentation this does not work within a windowing context. Also, it's not obvious to me how to check any element in a struct for NULLs with the struct functions that are available so far.

Regarding the use of dbplyr:::glue_sql2(), what should be the way to go: Wait with the PR until such a function becomes exported, reimplement it in this package, or keep using the private function?

@krlmlr
Copy link
Collaborator

krlmlr commented Mar 30, 2024

Thanks. If we must, we can reimplement here, or use plain paste0() ?

@lschneiderbauer
Copy link
Contributor Author

Thanks.

  • I am using the dbplyr::sql() function directly now, together with paste0(), instead of dbplyr:::glue_sql2().
  • cli::cli_abort() is replaced by stop() to get rid of R CMD CHECK warning.
  • I am making use of duckdb's row() function now instead of using the '{ ... }' syntax which makes generating SQL easier. (and also gets rid of glue::glue() to make R CMD CHECK happy)

@codecov-commenter
Copy link

codecov-commenter commented Mar 30, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79.76%. Comparing base (b8eee57) to head (82b4937).
Report is 7 commits behind head on main.

❗ Current head 82b4937 differs from pull request most recent head 243840f. Consider uploading reports for the commit 243840f to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #122      +/-   ##
==========================================
+ Coverage   79.70%   79.76%   +0.05%     
==========================================
  Files         108      108              
  Lines        3711     3722      +11     
==========================================
+ Hits         2958     2969      +11     
  Misses        753      753              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@lschneiderbauer
Copy link
Contributor Author

@krlmlr Anything else you would like me to improve?

Copy link
Collaborator

@krlmlr krlmlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the nudge!

Comment on lines +88 to +90
str_struct <- paste0("row(", paste0(list(...), collapse = ", "), ")")

sql(paste0("COUNT(DISTINCT ", str_struct, ")"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgirlich: Can you please help me review this part of the code?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically looks good to me. This also works for diverse user input, e.g. n_distinct(x, a / (b - 1)) or n_distinct(!!x).
One minor thing: you might want to check for empty dots.

pull(summarize(df_na, n = n_distinct(x, y)), n),
5
)
})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to add a test that uses a window.

Comment on lines +88 to +90
str_struct <- paste0("row(", paste0(list(...), collapse = ", "), ")")

sql(paste0("COUNT(DISTINCT ", str_struct, ")"))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically looks good to me. This also works for diverse user input, e.g. n_distinct(x, a / (b - 1)) or n_distinct(!!x).
One minor thing: you might want to check for empty dots.

@krlmlr krlmlr changed the title implement n_distinct() for multiple arguments using duckdb structs feat: Implement n_distinct() for multiple arguments using duckdb structs Apr 29, 2024
@krlmlr krlmlr merged commit 5013908 into duckdb:main Apr 29, 2024
23 checks passed
@krlmlr
Copy link
Collaborator

krlmlr commented Apr 29, 2024

Thanks! Adding the extra tests and checks could be done in a small follow-up PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature request: n_distinct() implementation w/ multiple arguments.
4 participants