We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
On large dataframes
innerjoin(left, right; on=cols)
runs out of memory so I end up doing
left.leftrow = 1:nrow(left) right.rightrow = 1:nrow(right) ix = innerjoin(left[:, vcat(cols, :leftrow)], right[:, vcat(cols, :rightrow)]) joined = hcat(left[ix.leftrow, :], right[ix.rightrow, :])
Two problems arise:
(1) This is slightly inconvenient. It would be nicer if I could do this without all the index hacking.
(2) This doesn't work on leftjoin because ix.rightrow contains missings so right[ix.rightrow, :] fails.
leftjoin
ix.rightrow
right[ix.rightrow, :]
I wonder what you think about these issues.
The text was updated successfully, but these errors were encountered:
This is strange (i.e. there must be some memory leak in join implementation). What you do is the same conceptually to what is done in: https://github.com/JuliaData/DataFrames.jl/blob/main/src/join/composer.jl#L242 and the later operations should allocate less than hcat.
hcat
So for sure it should be fixed. If you have an easy reproducer it would help to track down the issue.
Sorry, something went wrong.
If this is unexpected then it's possible I'm misreporting. I don't want to hold up any release so I'll close it until I can reproduce.
I run some quick tests and plain innerjoin allocated less than your combination (where I replaced : with ! to reduce memory consumption):
innerjoin
:
!
julia> @time innerjoin(left, right, on=:id => :id2); 3.030537 seconds (424 allocations: 1.128 GiB, 39.96% gc time) julia> @time ix = innerjoin(left[!, vcat(:id, :leftrow)], right[!, vcat(:id2, :rightrow)], on=:id => :id2); 2.702270 seconds (243 allocations: 1017.748 MiB, 35.03% gc time) julia> @time hcat(left[ix.leftrow, :], right[ix.rightrow, :]); 1.067658 seconds (184 allocations: 365.997 MiB, 75.13% gc time)
Sorry for the noise then!
No branches or pull requests
On large dataframes
runs out of memory so I end up doing
Two problems arise:
(1) This is slightly inconvenient. It would be nicer if I could do this without all the index hacking.
(2) This doesn't work on
leftjoin
becauseix.rightrow
contains missings soright[ix.rightrow, :]
fails.I wonder what you think about these issues.
The text was updated successfully, but these errors were encountered: