add a `leftjoin!` (or `match!` or `merge!` or whatever it should be called) #2259

ExpandingMan · 2020-05-15T16:51:19Z

I very frequently find myself needing to do

df = leftjoin(df, rdf, on=cols, validate=(false, true))

It's not that big a deal to have a new dataframe (after all, the underlying columns weren't copied) but I find myself having to do this so often (this is probably one of the most commonly used operations for me) that I would really love a mutating function for this. Maybe something like

match!(df, rdf, on=cols)

I started to look into what would be involved in doing this, but _join is pretty monolithic so it wasn't immediately obvious. I'd think that the biggest subtlety would be whether the ordering of rows in df can change (but I don't think they can even as things are?).

Thoughts? Am I the only one who does this all the time?

The text was updated successfully, but these errors were encountered:

bkamins · 2020-05-15T18:36:55Z

First we should finalize the API for #2243 as this is related. When we have a decision on "update" stuff, then this can be implemented at the same time.

leftjoin! should be relatively easy to add and it should just be leftjoin but in-place for its first argument.

ExpandingMan · 2020-05-15T18:41:38Z

If you're open to doing it before 1.0, I'd be happy to work on it after #2243 is finalized, because, like I said, I really use this quite a lot. If you think of it, feel free to post here again when #2243 is ready and I'll likely jump on it.

bkamins · 2020-05-15T20:00:50Z

Sure! I think 1.0 release will be in the end of 2020 so there is plenty of time for it. Thank you!

Note though that for performance reasons this should be a special implementation anyway (as we will not copy left data frame and we will only need to compute the indices for the right data frame)

ExpandingMan · 2020-05-15T20:09:02Z

Yeah, if and when I do it I might look into breaking up _join into some more convenient and re-usable pieces, if you guys are open to that (don't necessarily need to discuss now or here).

bkamins · 2020-05-15T20:11:36Z

Well - we will probably redesign _join anyway. It has a sub-par performance now, but we are not yet sure what we should do to improve it.

ExpandingMan · 2020-05-16T18:00:24Z

I was really tempted to start messing around with _join, but I told myself I have to stay focused and finish what I was already working on. Maybe I'm just being naive, but I feel like it should be easier than split-apply-combine anyway.

I'd really like to rewrite the joins to use groupby, then perform set operations on the keys and then somehow concatenate the views (into Arrays) given by the groupby. Anyway, I'll open a separate issue if I ever really start digging into rewriting the joins, but more likely one of you guys will get to it before I do.

bkamins · 2020-05-16T18:54:47Z

Actually I was considering to define *join for GroupedDataFrame objects (maybe as an additional definition). However, my initial testing has shown that it was slower to create two GroupedDataFrames and then join them than what we have now (but I might have made some inefficient implementation).

However, if we can make it fast, the additional benefit of such an approach would be that doing several joins using the same data frame would be much faster (groups have to be computed only once).

So in general: this direction is very promising and I would like to jointly explore it more soon!

ExpandingMan · 2020-05-16T19:12:36Z

Ok, interesting to hear we had the same thought about this.

I can't think of any reason why it couldn't be efficient. The only potential issue that I thought of with this approach that concerned me is that it might be less efficient to generate the group keys and then hash them rather than just calling some dead-simple broadcasted hash function on the rows before even getting into the groupby stuff. I would think that the most expensive part of all of this is the actual concatenation (we clearly don't want views for the joins), and I don't see any reason why that couldn't be just as fast using groupby than by going a more direct route. I think there may also be some extra "fluff" in the user-facing groupby function to make it nicer to work with and safer that we might strip out for extra efficiency in joins, but I wouldn't think any of that would actually be very expensive.

It seems to me that a groupby is a subroutine of a join, so even if we find some reason that using the existing groupby would be too slow, it might be worth it to modify groupby (admittedly quite tricky since that operation is already impressively fast) or add an alternative groupby algorithm that can be used by joins. As you implied, it would be really cool to give users the ability to do joins on GroupedDataFrames.

bkamins · 2020-05-16T19:58:59Z

As I think of it I see the following performance optimization:

do groupby on left data frame (possibly keeping some extra metadata)
then when doing groupby on the right data frame, but do not do it anew, but "as if" it were further rows of a bigger data frame. In this way you get proper matching of groups between two data frames "for free". Actually for the right data frame you can store lists of rows matching each group number in the left data frame. In this way you can do everything in one pass.

ExpandingMan · 2020-05-16T20:29:12Z

Hm, interesting, but wouldn't that require large modifications to the groupby code since it's not currently set up to track locations across multiple dataframes? I suppose you could create some sort of lazily concatenated DataFrame type instead (or perhaps you mean horizontally concatenating which is more-or-less free, but I don't think would handle the final concatenations of the joined rows for you). I see that this would allow you to easily get your joins just by calling combine, however I don't really understand what real performance benefits it would give you over calling groupby on the two dataframes separately (though I haven't looked at the groupby code very deeply, so I'm ignorant on this subject).

bkamins · 2020-05-16T20:55:59Z

The benefit is. That by calling two groupby you create two vectors, say:

groups1 = [1,1,2,3,4,5,2,1]
groups2 = [1,2,1,2,3,1]

and after this you still need to do mapping between group numbers in groups1 and groups2, e.g. to determine if 1 in groups1 has some matching number in groups2, and if yes, what it would be.
And even after you have these mappings you still need to perform proper row matching.

On the other hand if you are doing it in a single processing step, you do not need to create groups2 at all, but create an array (probably this is not the best data structure, but it is simplest so I use it to show the idea):

groups1 = [1,1,2,3,4,5,2,1]
rows2groups1 = [[1,2], [5], [3, 4], [], []]

telling you which rows of the second data frame map to which groups of the first data frame. Out of such a data structure it is really easy to properly resize first data frame (you just look at the lengths of rows2groups1 entries and the only special case is if the length is 0 in which case you use length 1 ). For the second data frame you just concatenate the vectors from rows2groups1 matching groups in groups1, with the special case that if the vector has length 0 you instead fill-in with missing values.

bkamins added the feature label May 15, 2020

bkamins added this to the 1.x milestone May 15, 2020

bkamins added the joins label May 8, 2021

bkamins mentioned this issue May 8, 2021

add a keyword to allow specifying target row order in joins #2753

Closed

bkamins modified the milestones: 1.x, 1.3 Aug 25, 2021

bkamins mentioned this issue Aug 26, 2021

Bk/add leftjoin! #2843

Merged

bkamins closed this as completed in #2843 Sep 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add a `leftjoin!` (or `match!` or `merge!` or whatever it should be called) #2259

add a `leftjoin!` (or `match!` or `merge!` or whatever it should be called) #2259

ExpandingMan commented May 15, 2020

bkamins commented May 15, 2020

ExpandingMan commented May 15, 2020

bkamins commented May 15, 2020

ExpandingMan commented May 15, 2020

bkamins commented May 15, 2020

ExpandingMan commented May 16, 2020

bkamins commented May 16, 2020

ExpandingMan commented May 16, 2020 •

edited

Loading

bkamins commented May 16, 2020

ExpandingMan commented May 16, 2020

bkamins commented May 16, 2020

add a leftjoin! (or match! or merge! or whatever it should be called) #2259

add a leftjoin! (or match! or merge! or whatever it should be called) #2259

Comments

ExpandingMan commented May 15, 2020

bkamins commented May 15, 2020

ExpandingMan commented May 15, 2020

bkamins commented May 15, 2020

ExpandingMan commented May 15, 2020

bkamins commented May 15, 2020

ExpandingMan commented May 16, 2020

bkamins commented May 16, 2020

ExpandingMan commented May 16, 2020 • edited Loading

bkamins commented May 16, 2020

ExpandingMan commented May 16, 2020

bkamins commented May 16, 2020

add a `leftjoin!` (or `match!` or `merge!` or whatever it should be called) #2259

add a `leftjoin!` (or `match!` or `merge!` or whatever it should be called) #2259

ExpandingMan commented May 16, 2020 •

edited

Loading