Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a keyword to allow specifying target row order in joins #2753

Closed
bkamins opened this issue May 7, 2021 · 10 comments · Fixed by #3233
Closed

add a keyword to allow specifying target row order in joins #2753

bkamins opened this issue May 7, 2021 · 10 comments · Fixed by #3233

Comments

@bkamins
Copy link
Member

bkamins commented May 7, 2021

This issue is meant to gather requirements for all join* operations regarding the resulting order of rows after the join operation. After we finish the discussion I will implement it via kwarg to appropriate functions (the default will be the fastest way to produce a result, passing a kwarg will allow user to request a specific order but we need to learn what orders for what joins would be desirable).

@bkamins bkamins added this to the 1.x milestone May 7, 2021
@Byrth
Copy link

Byrth commented May 8, 2021

I recognize that performance concerns are likely what drove the change, but here is what I was expecting:

df1.__id1 = Base.OneTo(nrow(df1))
df2.__id2 = Base.OneTo(nrow(df2))
df3 = ___join(df1,df2,on=keys)

Use left table's row order and then right

left: sort!(df3, ["__id1","__id2"])
inner: sort!(df3, ["__id1","__id2"])
cross: sort!(df3, ["__id1","__id2"])
outer: sort!(df3, ["__id1","__id2"])

Use left table's row order only:

semi: sort!(df3, ["__id1"])
anti: sort!(df3, ["__id1"])

Use right table's row order and then left

right: sort!(df3, ["__id2","__id1"])

Argument in favor of ordering

leftjoin, semijoin, antijoin, and rightjoin all have a clearly implied direction where one table is joined onto the other table (or only one table remains). A naive implementation of these would simply iterate row-wise over the primary table and look for hash matches in the secondary table. That's where their ordering proposals above come from.

innerjoin, outerjoin, and crossjoin are ambiguous because they consider rows for both sides. Without rightjoin, table primacy could be consistently given to the first argument and the intuition for these would be obvious. In fact, the above proposal for the join sort order would actually be the same for all joins if only rightjoin didn't exist. So I guess I went with the leftjoin ordering for inner/outer/cross because I don't use rightjoin, but I admit that it's ambiguous.

Argument against ordering

The ANSI SQL standard does not define a row order resulting from joins. I can only find a free draft from 1992 but section 7.5 doesn't seem to mention it. All join implementation documentation that I have seen is also silent on the issue.

It's also obviously more computationally expensive to do ordering and offers no practical benefit if rows are replicated in both tables (like in a crossjoin).

But what if rows are replicated in only one table? (i.e. the argument for leftjoin!)

Maintaining row order in one table would allow for a copycols=false-like keyword argument to joins where you re-use that table's vectors if possible. This can save a bunch of allocations if one is, as I typically am, joining a very large table to smaller ones with no row replication in the very large table.

Joins in DataFrames.jl copy all vectors, it used to be much more efficient to do something like what is in the leftjoin!() below because you avoid copying columns <y.

using DataFrames # 0.22.7

nr = 1000000

df1 = DataFrame(a = rand(nr), b = rand(nr), c = rand(nr), d = rand(nr),
    e = rand(nr), f = rand(nr), g = rand(nr), h = rand(nr), i = rand(nr),
     j = rand(nr), k = rand(nr), l = rand(nr), m = rand(nr), n = rand(nr),
    o = rand(nr), p = rand(nr), q = rand(nr), r = rand(nr), s = rand(nr),
    t = rand(nr), u = rand(nr), v = rand(nr), w = rand(nr), x = rand(nr),
    y = rand(['a','b','c'], nr));
df2 = DataFrame(y = ['a','b','c'], z = [1,2,3]);

function leftjoin!(df1, df2; on)
    df_tmp = leftjoin(select(df1, on; copycols=false), df2, on=on); # order of df3 matches df1 if the join is unique
    df1.z = df_tmp.z
    df1
end

df0 = @time copy(df1; copycols=true); # 0.204061 seconds (94 allocations: 186.927 MiB) # only necessary for testing in my case
df3 = @time leftjoin!(df0,df2;on="y"); # 0.360305 seconds (246 allocations: 91.449 MiB)
df3 = @time leftjoin(df1, df2, on="y"); # 0.691530 seconds (589 allocations: 274.568 MiB)

Upgrading to DataFrames 1.1.1, the order of df1 is not preserved in the join so the leftjoin! has to get more complicated:

function leftjoin!(df1, df2; on)
    df1.__id = Base.OneTo(UInt32(nrow(df1)))
    df_tmp = leftjoin(select(df1, union([on],["__id"]); copycols=false), df2, on=on); # order of df3 matches df1 if the join is unique
    sort!(df_tmp, "__id")
    df1.z = df_tmp.z
    select!(df1, Not("__id"))
end

df0 = @time copy(df1; copycols=true); # 0.053088 seconds (322 allocations: 186.940 MiB) # only necessary for testing in my case
df3 = @time leftjoin!(df0,df2;on="y"); # 0.088398 seconds (513 allocations: 53.436 MiB)
df3 = @time leftjoin(df1, df2, on="y"); # 0.123257 seconds (854 allocations: 213.671 MiB)

So the order in particular doesn't matter much to me and even the more complicated wrapper is still way more efficient/faster than it was in 0.22.7, but it used to make it a lot easier to avoid allocations.

If you imagine:

df2 = leftjoin(leftjoin(leftjoin(df, model1, on=keys1), model2, on=keys2), model3, on=keys3)

then only the inner-most leftjoin ever needs to copy columns and the outer two could be leftjoin!(), about a 50% decrease in amount of RAM allocated using the above totals.

Summary

In the course of writing this post, I realized that instead of needing the order to be maintained, I really just need to write the leftjoin!() wrapper for myself and import it. It isn't that much more complicated than before and it is still way more efficient.

@bkamins
Copy link
Member Author

bkamins commented May 8, 2021

the ! version of join is also on a to-do list, see #1334 and #2259.

@bkamins
Copy link
Member Author

bkamins commented Jun 3, 2021

@pstorozenko - this is the next thing to think of in joins if you want 😄. There are two layers:

  • thinking of useful API; currently I think of order kwarg
  • designing an efficient implementation

Even starting with the minimal (but most common case), i.e. adding it only to leftjoin with order=:left would probably cover most of the use cases (and if we had this then we could think of extensions which should follow naturally).

@pstorozenko
Copy link
Contributor

Thanks, but I'd like to understand better how to perform efficient joins in the first place. After looking at h2o benchmarks I want to take a deep dive into polars and data.table way of joining and compare it with DataFrames.jl. This might be very beneficial in the future :)

@bkamins
Copy link
Member Author

bkamins commented Jun 3, 2021

Indeed this also makes a lot of sense. Could you please share a way to contact you directly (my e-mail is bkamins@sgh.waw.pl)? I think it would be good to discuss it and could share you my experience.

@bkamins
Copy link
Member Author

bkamins commented Jun 3, 2021

BTW: the benchmarks are currently re-calculated and I expect improvements for joins in 5GB case (currently we are bogged down by GC). Hopefully tomorrow we will have an update on H2O website (the solution I proposed now does not resolve all problems but it should improve things).

@bkamins
Copy link
Member Author

bkamins commented Jul 29, 2021

TODO: if we add this then we should also add leftjoin! and rightjoin! that would update left/right table respectively.

@bkamins bkamins modified the milestones: 1.x, 1.3 Aug 25, 2021
@pdeffebach
Copy link
Contributor

Bumping this, I ran into this today.

I think we can add a keyword argument for preserving order without adding leftjoin!.

@bkamins bkamins changed the title add a keyword to allow specifying target column order in joins add a keyword to allow specifying target row order in joins Jul 21, 2022
@bkamins
Copy link
Member Author

bkamins commented Jul 21, 2022

Well this issue is super old. We added leftjoin! in the mean time that preserves order. I will mark it for 1.5 release.

@bkamins bkamins modified the milestones: 1.x, 1.5 Jul 21, 2022
@pdeffebach
Copy link
Contributor

Ahh I forgot about that. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants