Rename SubDataFrame columns #3317

JoaoAparicio · 2023-04-18T16:51:38Z

Currently there is no way (as far as I can tell, please correct) to rename the columns of a SubDataFrame by either mutating the SubDataFrame (not the underlying DataFrame) or copying the SubDataFrame (but not the underlying DataFrame).

Here's what I've tried so far:

import DataFrames
N = 1_000_000
df = DataFrames.DataFrame(
    :x => [rand(('a','b')) for _ in 1:N],
    :y => 1:N,
    :z => 1:N,
)
gdf = DataFrames.groupby(df, :x);
sdf = gdf |> first;

example 1

mutates the underlying DataFrame, you can check by running it twice, it will fail the second time
this is not what i want. also, this only works with some dataframes, see example 3

@time DataFrames.rename!(sdf, :x => :x2)  # mutates df

example 2

this filters and copies the underlying DataFrame into a new DataFrame
also not what i want, i don't want to copy the data

@time DataFrames.rename(sdf, :x => :x2)  # returns DataFrame even though sdf is SubDataFrame

example 3

rename! only works on some subdataframes

ssdf = sdf[!, [:x, :y]];
@time DataFrames.rename!(ssdf, :x => :a)
ERROR: ArgumentError: rename! is not supported for views other than created with Colon as a column selector

The text was updated successfully, but these errors were encountered:

bkamins · 2023-04-18T17:39:51Z

In general SubDataFrame is a view, so by design you cannot have different column names than its parent. So you can either: also change a parent or return a freshly allocated data frame with new column names.

Before going to #3318 let us discuss what is the use-case you have and then we can decide how to meet your needs.

Maybe what you need is:

julia> @time rename!(DataFrame(sdf, copycols=false), :x => :x2);
  0.001350 seconds (53 allocations: 4.094 KiB)

?

JoaoAparicio · 2023-04-18T18:37:36Z

Hi Bogumił! :-) Thanks for asking

So right now our use case is:
.1 We have some database which is slow to query
.2 We want to store the db data in arrow files but at the same time partition the dataset in a different way and rename some of the columns.
.3 We do this by doing 1 big query, putting the result into a DataFrame (think 10M rows, 100 columns), and then have 2 nested for loops of groupby, and at the inner loop we have an Arrow.write. But before the write we want to rename some columns. The groupbys are great, they iterate over SubDataFrames. However to rename columns we tried materializing the views SubDataFrames |> DataFrames and then renaming that. That's bad when the parent df is large.

I'm looking at your example and I'm surprised.

Does DataFrame(SubDataFrame) copy data, or not?
I found this [1] so it looks like eachcol(sdf) are Vector views and your approach collects those, so the underlying data isn't copied. That explains why little data is allocated. But then this means that you end up with a DataFrame that contains views? :-)

Anyway, to answer your question: The approach that you presented does indeed solve our problem, we will use it.

(But let me just mention - in our specific use case data copying is the issue, not latency. But if latency was the issue, I'll just point out that the new approach is about 500x times faster (which again isn't our current problem)).

@btime rename!(DataFrame(sdf, copycols=false), :x => :x2)  # 430us
@btime rename(sdf, :x => :x2)  # 720ns

Thanks for your help :-)

[1]

DataFrames.jl/src/subdataframe/subdataframe.jl

Lines 309 to 317 in 23a28b1

    
           function DataFrame(sdf::SubDataFrame; copycols::Bool=true) 
        
               if copycols 
        
                   return sdf[:, :] 
        
               else 
        
                   new_df = DataFrame(collect(eachcol(sdf)), _names(sdf), copycols=false) 
        
                   _copy_all_note_metadata!(new_df, sdf) 
        
                   return new_df 
        
               end 
        
           end

JoaoAparicio · 2023-04-18T19:10:57Z

I just realized that situations where you do many renames (as opposed to just one) your approach might actually end up being faster, because my approach would copy the df index over and over, and yours would instantiate a (non-copying) df once but then you can rename! many times.

bkamins · 2023-04-18T21:12:29Z

Does DataFrame(SubDataFrame) copy data, or not?

It depends on copycols argument as you see in the method you linked. By default it copies data, but with copycols=false it stores views of the source columns.

I'll just point out that the new approach is about 500x times faster

Yes, I am aware that there is this difference. However, since DataFrame is not type stable anyway anything you later do with it will not cost nanoseconds but more. In short: I typically optimize things that cost of order of seconds, but do not optimize something that costs order of milliseconds since such optimizations will not be noticeable most of the time anyway.

If nanosecond speed is needed buy the user it is probably better to switch to type-stable containers.

In summary - can the issue and PR be closed?

JoaoAparicio · 2023-04-18T22:37:06Z

Yes! Thank you very much for the clarifications

JoaoAparicio mentioned this issue Apr 18, 2023

Add rename() for Index, SubIndex, and SubDataFrame #3318

Closed

bkamins added the feature label Apr 18, 2023

bkamins added this to the 1.6 milestone Apr 18, 2023

JoaoAparicio closed this as completed Apr 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rename SubDataFrame columns #3317

Rename SubDataFrame columns #3317

JoaoAparicio commented Apr 18, 2023

bkamins commented Apr 18, 2023

JoaoAparicio commented Apr 18, 2023

JoaoAparicio commented Apr 18, 2023 •

edited

Loading

bkamins commented Apr 18, 2023

JoaoAparicio commented Apr 18, 2023

Rename SubDataFrame columns #3317

Rename SubDataFrame columns #3317

Comments

JoaoAparicio commented Apr 18, 2023

example 1

example 2

example 3

bkamins commented Apr 18, 2023

JoaoAparicio commented Apr 18, 2023

JoaoAparicio commented Apr 18, 2023 • edited Loading

bkamins commented Apr 18, 2023

JoaoAparicio commented Apr 18, 2023

JoaoAparicio commented Apr 18, 2023 •

edited

Loading