Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add keep argument to joins #330

Open
erdeyl opened this issue Jan 15, 2022 · 2 comments
Open

Add keep argument to joins #330

erdeyl opened this issue Jan 15, 2022 · 2 comments
Labels
feature a feature request or enhancement

Comments

@erdeyl
Copy link

erdeyl commented Jan 15, 2022

Please briefly describe your problem and what output you expect. If you have a question, please don't use this form. Instead, ask on https://stackoverflow.com/ or https://community.rstudio.com/.

Please include a minimal reproducible example (AKA a reprex). If you've never heard of a reprex before, start by reading https://www.tidyverse.org/help/#reprex.

Brief description of the problem

# insert reprex here
@markfairbanks
Copy link
Collaborator

markfairbanks commented Jan 31, 2022

The basic idea of how to solve this is we can use the j term of [.data.table to select the variables we want. I think this will work for left/right/inner joins, but some extra steps are required for full joins which I don't think will work with dtplyr's lazy workflow.

data.table uses x[i, j, by] syntax. Noting that in a join we're using the x and i arguments, we can select inside j to implement keep. In the examples below note the x. and i. prefixes to help refer to the data.tables we're selecting from. Also note I'm assuming suffix = c(".x", ".y") for these examples.

left_join()

library(data.table)

df1 <- data.table(a = c("a", "b", "c"), b = 1:3)
df2 <- data.table(a = c("a", "b"), c = 1:2)

# left_join with keep = TRUE
df2[df1, .(a.x = i.a, b = i.b, a.y = x.a, c = x.c), on = .(a), allow.cartesian = TRUE]
#>       a.x     b    a.y     c
#>    <char> <int> <char> <int>
#> 1:      a     1      a     1
#> 2:      b     2      b     2
#> 3:      c     3   <NA>    NA

In the case of left joins this will be useful without keep because now we can drop the setcolorder() and setnames() calls and just directly use this selection:

# left_join with keep = FALSE
df2[df1, .(a = i.a, b = i.b, c = x.c), on = .(a), allow.cartesian = TRUE]
#>         a     b     c
#>    <char> <int> <int>
#> 1:      a     1     1
#> 2:      b     2     2
#> 3:      c     3    NA

inner_join()/right_join()

For inner/right joins we'll only need to use the selection syntax if keep = TRUE.

I'll just use an example of a right join here, but we can use the exact same selection syntax for inner joins.

# right_join with keep = TRUE
df1[df2, .(a.x = x.a, b = x.b, a.y = i.a, c = i.c), on = .(a),  allow.cartesian = TRUE]
#>       a.x     b    a.y     c
#>    <char> <int> <char> <int>
#> 1:      a     1      a     1
#> 2:      b     2      b     2

full_join()

I don't think we can implement this in dtplyr because of the lazy workflow. The only way I could figure out how to get it to work required an intermediate data frame that gets distinct join keys from df1 and df2. For this to work we would need to run collect() on both inputs, which would be very expensive in the case of a long pipe chain.

library(data.table)

df1 <- data.table(a = c("a", "b", "c"), b = 1:3)
df2 <- data.table(a = c("a", "b"), c = 1:2)

unique_keys_df <- unique(rbindlist(list(
  df1[, .(a)],
  df2[, .(a)]
)))

# right_join with keep = TRUE
# Note: a__keep__ comes from df2.
  ## Join column is preserved from unique_keys_df for join in the next step
step_df <- df2[unique_keys_df, .(a__keep__ = x.a, c = x.c, a = i.a), on = .(a)]

# Another right_join onto step_df
# Don't need join cols from step_df,
  ## but need to rename join cols originally from df2.
  ## I appended these join cols with __keep__ to show which ones are needed
df1[step_df, .(a.x = x.a, b = x.b, a.y = a__keep__, c = i.c), on = .(a)]
#>       a.x     b    a.y     c
#>    <char> <int> <char> <int>
#> 1:      a     1      a     1
#> 2:      b     2      b     2
#> 3:      c     3   <NA>    NA

@markfairbanks
Copy link
Collaborator

markfairbanks commented Jan 31, 2022

I used a helper in tidytable called join_selection() to create the selection - that can be found here. I think we can use it here, we would just need to fit it into @mgirlich's join workflow.

@markfairbanks markfairbanks added the feature a feature request or enhancement label Jun 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants