-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add a keyword to allow specifying target row order in joins #2753
Comments
I recognize that performance concerns are likely what drove the change, but here is what I was expecting: df1.__id1 = Base.OneTo(nrow(df1))
df2.__id2 = Base.OneTo(nrow(df2))
df3 = ___join(df1,df2,on=keys) Use left table's row order and then rightleft: Use left table's row order only:semi: Use right table's row order and then leftright: Argument in favor of orderingleftjoin, semijoin, antijoin, and rightjoin all have a clearly implied direction where one table is joined onto the other table (or only one table remains). A naive implementation of these would simply iterate row-wise over the primary table and look for hash matches in the secondary table. That's where their ordering proposals above come from. innerjoin, outerjoin, and crossjoin are ambiguous because they consider rows for both sides. Without rightjoin, table primacy could be consistently given to the first argument and the intuition for these would be obvious. In fact, the above proposal for the join sort order would actually be the same for all joins if only rightjoin didn't exist. So I guess I went with the leftjoin ordering for inner/outer/cross because I don't use rightjoin, but I admit that it's ambiguous. Argument against orderingThe ANSI SQL standard does not define a row order resulting from joins. I can only find a free draft from 1992 but section 7.5 doesn't seem to mention it. All join implementation documentation that I have seen is also silent on the issue. It's also obviously more computationally expensive to do ordering and offers no practical benefit if rows are replicated in both tables (like in a crossjoin). But what if rows are replicated in only one table? (i.e. the argument for
|
@pstorozenko - this is the next thing to think of in joins if you want 😄. There are two layers:
Even starting with the minimal (but most common case), i.e. adding it only to |
Thanks, but I'd like to understand better how to perform efficient joins in the first place. After looking at h2o benchmarks I want to take a deep dive into |
Indeed this also makes a lot of sense. Could you please share a way to contact you directly (my e-mail is bkamins@sgh.waw.pl)? I think it would be good to discuss it and could share you my experience. |
BTW: the benchmarks are currently re-calculated and I expect improvements for joins in 5GB case (currently we are bogged down by GC). Hopefully tomorrow we will have an update on H2O website (the solution I proposed now does not resolve all problems but it should improve things). |
TODO: if we add this then we should also add |
Bumping this, I ran into this today. I think we can add a keyword argument for preserving order without adding |
Well this issue is super old. We added |
Ahh I forgot about that. Thanks. |
This issue is meant to gather requirements for all
join*
operations regarding the resulting order of rows after the join operation. After we finish the discussion I will implement it via kwarg to appropriate functions (the default will be the fastest way to produce a result, passing a kwarg will allow user to request a specific order but we need to learn what orders for what joins would be desirable).The text was updated successfully, but these errors were encountered: