Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename SubDataFrame columns #3317

Closed
JoaoAparicio opened this issue Apr 18, 2023 · 5 comments
Closed

Rename SubDataFrame columns #3317

JoaoAparicio opened this issue Apr 18, 2023 · 5 comments
Labels
Milestone

Comments

@JoaoAparicio
Copy link

Currently there is no way (as far as I can tell, please correct) to rename the columns of a SubDataFrame by either mutating the SubDataFrame (not the underlying DataFrame) or copying the SubDataFrame (but not the underlying DataFrame).

Here's what I've tried so far:

import DataFrames
N = 1_000_000
df = DataFrames.DataFrame(
    :x => [rand(('a','b')) for _ in 1:N],
    :y => 1:N,
    :z => 1:N,
)
gdf = DataFrames.groupby(df, :x);
sdf = gdf |> first;

example 1

mutates the underlying DataFrame, you can check by running it twice, it will fail the second time
this is not what i want. also, this only works with some dataframes, see example 3

@time DataFrames.rename!(sdf, :x => :x2)  # mutates df

example 2

this filters and copies the underlying DataFrame into a new DataFrame
also not what i want, i don't want to copy the data

@time DataFrames.rename(sdf, :x => :x2)  # returns DataFrame even though sdf is SubDataFrame

example 3

rename! only works on some subdataframes

ssdf = sdf[!, [:x, :y]];
@time DataFrames.rename!(ssdf, :x => :a)
ERROR: ArgumentError: rename! is not supported for views other than created with Colon as a column selector
@bkamins
Copy link
Member

bkamins commented Apr 18, 2023

In general SubDataFrame is a view, so by design you cannot have different column names than its parent. So you can either: also change a parent or return a freshly allocated data frame with new column names.

Before going to #3318 let us discuss what is the use-case you have and then we can decide how to meet your needs.

Maybe what you need is:

julia> @time rename!(DataFrame(sdf, copycols=false), :x => :x2);
  0.001350 seconds (53 allocations: 4.094 KiB)

?

@bkamins bkamins added this to the 1.6 milestone Apr 18, 2023
@JoaoAparicio
Copy link
Author

Hi Bogumił! :-) Thanks for asking

So right now our use case is:
.1 We have some database which is slow to query
.2 We want to store the db data in arrow files but at the same time partition the dataset in a different way and rename some of the columns.
.3 We do this by doing 1 big query, putting the result into a DataFrame (think 10M rows, 100 columns), and then have 2 nested for loops of groupby, and at the inner loop we have an Arrow.write. But before the write we want to rename some columns. The groupbys are great, they iterate over SubDataFrames. However to rename columns we tried materializing the views SubDataFrames |> DataFrames and then renaming that. That's bad when the parent df is large.

I'm looking at your example and I'm surprised.

Does DataFrame(SubDataFrame) copy data, or not?
I found this [1] so it looks like eachcol(sdf) are Vector views and your approach collects those, so the underlying data isn't copied. That explains why little data is allocated. But then this means that you end up with a DataFrame that contains views? :-)

Anyway, to answer your question: The approach that you presented does indeed solve our problem, we will use it.

(But let me just mention - in our specific use case data copying is the issue, not latency. But if latency was the issue, I'll just point out that the new approach is about 500x times faster (which again isn't our current problem)).

@btime rename!(DataFrame(sdf, copycols=false), :x => :x2)  # 430us
@btime rename(sdf, :x => :x2)  # 720ns

Thanks for your help :-)

[1]

function DataFrame(sdf::SubDataFrame; copycols::Bool=true)
if copycols
return sdf[:, :]
else
new_df = DataFrame(collect(eachcol(sdf)), _names(sdf), copycols=false)
_copy_all_note_metadata!(new_df, sdf)
return new_df
end
end

@JoaoAparicio
Copy link
Author

JoaoAparicio commented Apr 18, 2023

I just realized that situations where you do many renames (as opposed to just one) your approach might actually end up being faster, because my approach would copy the df index over and over, and yours would instantiate a (non-copying) df once but then you can rename! many times.

@bkamins
Copy link
Member

bkamins commented Apr 18, 2023

Does DataFrame(SubDataFrame) copy data, or not?

It depends on copycols argument as you see in the method you linked. By default it copies data, but with copycols=false it stores views of the source columns.

I'll just point out that the new approach is about 500x times faster

Yes, I am aware that there is this difference. However, since DataFrame is not type stable anyway anything you later do with it will not cost nanoseconds but more. In short: I typically optimize things that cost of order of seconds, but do not optimize something that costs order of milliseconds since such optimizations will not be noticeable most of the time anyway.

If nanosecond speed is needed buy the user it is probably better to switch to type-stable containers.


In summary - can the issue and PR be closed?

@JoaoAparicio
Copy link
Author

Yes! Thank you very much for the clarifications

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants