Add keep argument to joins #330

erdeyl · 2022-01-15T21:25:19Z

Please briefly describe your problem and what output you expect. If you have a question, please don't use this form. Instead, ask on https://stackoverflow.com/ or https://community.rstudio.com/.

Please include a minimal reproducible example (AKA a reprex). If you've never heard of a reprex before, start by reading https://www.tidyverse.org/help/#reprex.

Brief description of the problem

# insert reprex here

markfairbanks · 2022-01-31T19:57:46Z

The basic idea of how to solve this is we can use the j term of [.data.table to select the variables we want. I think this will work for left/right/inner joins, but some extra steps are required for full joins which I don't think will work with dtplyr's lazy workflow.

data.table uses x[i, j, by] syntax. Noting that in a join we're using the x and i arguments, we can select inside j to implement keep. In the examples below note the x. and i. prefixes to help refer to the data.tables we're selecting from. Also note I'm assuming suffix = c(".x", ".y") for these examples.

left_join()

library(data.table)

df1 <- data.table(a = c("a", "b", "c"), b = 1:3)
df2 <- data.table(a = c("a", "b"), c = 1:2)

# left_join with keep = TRUE
df2[df1, .(a.x = i.a, b = i.b, a.y = x.a, c = x.c), on = .(a), allow.cartesian = TRUE]
#>       a.x     b    a.y     c
#>    <char> <int> <char> <int>
#> 1:      a     1      a     1
#> 2:      b     2      b     2
#> 3:      c     3   <NA>    NA

In the case of left joins this will be useful without keep because now we can drop the setcolorder() and setnames() calls and just directly use this selection:

# left_join with keep = FALSE
df2[df1, .(a = i.a, b = i.b, c = x.c), on = .(a), allow.cartesian = TRUE]
#>         a     b     c
#>    <char> <int> <int>
#> 1:      a     1     1
#> 2:      b     2     2
#> 3:      c     3    NA

inner_join()/right_join()

For inner/right joins we'll only need to use the selection syntax if keep = TRUE.

I'll just use an example of a right join here, but we can use the exact same selection syntax for inner joins.

# right_join with keep = TRUE
df1[df2, .(a.x = x.a, b = x.b, a.y = i.a, c = i.c), on = .(a),  allow.cartesian = TRUE]
#>       a.x     b    a.y     c
#>    <char> <int> <char> <int>
#> 1:      a     1      a     1
#> 2:      b     2      b     2

full_join()

I don't think we can implement this in dtplyr because of the lazy workflow. The only way I could figure out how to get it to work required an intermediate data frame that gets distinct join keys from df1 and df2. For this to work we would need to run collect() on both inputs, which would be very expensive in the case of a long pipe chain.

library(data.table)

df1 <- data.table(a = c("a", "b", "c"), b = 1:3)
df2 <- data.table(a = c("a", "b"), c = 1:2)

unique_keys_df <- unique(rbindlist(list(
  df1[, .(a)],
  df2[, .(a)]
)))

# right_join with keep = TRUE
# Note: a__keep__ comes from df2.
  ## Join column is preserved from unique_keys_df for join in the next step
step_df <- df2[unique_keys_df, .(a__keep__ = x.a, c = x.c, a = i.a), on = .(a)]

# Another right_join onto step_df
# Don't need join cols from step_df,
  ## but need to rename join cols originally from df2.
  ## I appended these join cols with __keep__ to show which ones are needed
df1[step_df, .(a.x = x.a, b = x.b, a.y = a__keep__, c = i.c), on = .(a)]
#>       a.x     b    a.y     c
#>    <char> <int> <char> <int>
#> 1:      a     1      a     1
#> 2:      b     2      b     2
#> 3:      c     3   <NA>    NA

markfairbanks · 2022-01-31T20:52:59Z

I used a helper in tidytable called join_selection() to create the selection - that can be found here. I think we can use it here, we would just need to fit it into @mgirlich's join workflow.

markfairbanks added the feature a feature request or enhancement label Jun 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add keep argument to joins #330

Add keep argument to joins #330

erdeyl commented Jan 15, 2022

markfairbanks commented Jan 31, 2022 •

edited

Loading

markfairbanks commented Jan 31, 2022 •

edited

Loading

Add keep argument to joins #330

Add keep argument to joins #330

Comments

erdeyl commented Jan 15, 2022

markfairbanks commented Jan 31, 2022 • edited Loading

markfairbanks commented Jan 31, 2022 • edited Loading

markfairbanks commented Jan 31, 2022 •

edited

Loading

markfairbanks commented Jan 31, 2022 •

edited

Loading