Make indexing of eachrow return the object of the same type on a view of the parent #3037

bkamins · 2022-04-09T16:30:52Z

Difference in performance:

julia> df = DataFrame(x=1:10^7);

julia> er = eachrow(df);

julia> @benchmark er[1:2:end]
BenchmarkTools.Trial: 10000 samples with 930 evaluations.
 Range (min … max):  101.398 ns …   4.602 μs  ┊ GC (min … max): 0.00% … 97.35%
 Time  (median):     114.194 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   125.150 ns ± 188.553 ns  ┊ GC (mean ± σ):  6.88% ±  4.44%

         ▂▇██▄▂
  ▄▄▄▃▄▅▆██████▆▄▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  101 ns           Histogram: frequency by time          179 ns <

 Memory estimate: 160 bytes, allocs estimate: 5.

julia> @benchmark er[1:2:end, 1]
BenchmarkTools.Trial: 24 samples with 1 evaluation.
 Range (min … max):  106.814 ms … 376.798 ms  ┊ GC (min … max):  0.00% … 71.58%
 Time  (median):     222.870 ms               ┊ GC (median):    53.74%
 Time  (mean ± σ):   217.177 ms ±  84.670 ms  ┊ GC (mean ± σ):  50.89% ± 25.07%

  █▃         ▃▃                 █
  ██▁▁▁▁▁▁▁▁▁██▁▁▁▇▁▁▁▁▁▁▇▁▇▁▇▁▇█▁▁▇▇▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁▇▁▇▁▇ ▁
  107 ms           Histogram: frequency by time          377 ms <

 Memory estimate: 343.32 MiB, allocs estimate: 5000007.

julia> s = trues(10^7);

julia> @benchmark er[s]
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  30.100 μs … 136.300 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     33.600 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   34.640 μs ±   6.310 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃▅▇▇██▇▇▆▅▄▃▂ ▁ ▂▁ ▁  ▁                                      ▃
  ████████████████████▇▇█▇▇▇▇▇█▇▇█▇▅▆▆▆▆▆▄▅▅▅▅▄▄▁▅▅▄▄▃▅▄▃▁▄▁▃▅ █
  30.1 μs       Histogram: log(frequency) by time      66.8 μs <

 Memory estimate: 160 bytes, allocs estimate: 4.

julia> @benchmark er[s, 1]
BenchmarkTools.Trial: 10 samples with 1 evaluation.
 Range (min … max):  241.458 ms … 859.491 ms  ┊ GC (min … max):  0.00% … 75.40%
 Time  (median):     485.684 ms               ┊ GC (median):    55.90%
 Time  (mean ± σ):   538.071 ms ± 208.937 ms  ┊ GC (mean ± σ):  59.62% ± 22.31%

  ▁         ▁ ▁    ▁▁          ▁         ▁           █        ▁  
  █▁▁▁▁▁▁▁▁▁█▁█▁▁▁▁██▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁█ ▁
  241 ms           Histogram: frequency by time          859 ms <

 Memory estimate: 686.65 MiB, allocs estimate: 10000004.

For eachcol I have not changed anything as now we have:

Base.@propagate_inbounds Base.getindex(itr::DataFrameColumns, idx::MultiColumnIndex) =
    eachcol(parent(itr)[!, idx])

and the only difference would be that we could change it to:

Base.@propagate_inbounds Base.getindex(itr::DataFrameColumns, idx::MultiColumnIndex) =
    eachcol(@view parent(itr)[!, idx])

but I am not sure it is worth it (I think not).

bkamins · 2022-04-09T16:31:59Z

Also note that still doing er[1:2, 1] or er[1:2, 1:1, 1:1] etc. is allowed and produces Array as previously. I have not disabled this kind of indexing, but maybe we should (I think it is OK to keep it).

nalimilan · 2022-04-09T17:02:50Z

Wouldn't it make sense to use the same type (either DataFrame or SubDataFrame) for both eachcol and eachrow?

I don't see the performance comparison. Am I missing something? :-)

Also note that still doing er[1:2, 1] or er[1:2, 1:1, 1:1] etc. is allowed and produces Array as previously. I have not disabled this kind of indexing, but maybe we should (I think it is OK to keep it).

Yeah, better keep allowing it since it doesn't really hurt, given that we already allow it.

bkamins · 2022-04-09T17:18:37Z

I don't see the performance comparison. Am I missing something? :-)

Note that I am comparing e.g. @benchmark er[1:2:end] which uses the new result to @benchmark er[1:2:end, 1]
which produces a Vector (so what we used to produce with @benchmark er[1:2:end] before this PR).

In this way I can easily compare main to this PR without having to restart REPL :).

Wouldn't it make sense to use the same type (either DataFrame or SubDataFrame) for both eachcol and eachrow?

For eachrow we have to use SubDataFrame as we are subsetting rows.

For eachcol we select all rows with ! anyway so SubDataFrame is not needed. The difference is:

currently we use eachcol(parent(itr)[!, idx]) which produces a DataFrame that has the same columns (as tested by ===) as original data frame, but it is detached from it (it has a separate index etc.). The benefit is that if you write eachcol(df).somecolumn you get exactly the same column as when you write df.somecolumn
instead using eachcol(@view parent(itr)[!, idx]) would mean that eachcol(df).somecolumn would produce a view of df.somecolumn. Which I think is not desirable

bkamins · 2022-04-09T18:40:30Z

Thank you!

bkamins added 2 commits April 9, 2022 18:23

make multirow indexing of DataFrameRows produce DataFrameRows

a321842

add NEWS.md entry

907a36f

bkamins added the performance label Apr 9, 2022

bkamins added this to the 1.4 milestone Apr 9, 2022

bkamins requested a review from nalimilan April 9, 2022 16:32

nalimilan approved these changes Apr 9, 2022

View reviewed changes

bkamins merged commit 0c79493 into main Apr 9, 2022

bkamins deleted the bk/eachrow_indexing branch April 9, 2022 18:40

bkamins mentioned this pull request Oct 5, 2022

Filtering of eachrow(df) not working in 1.4.0 #3191

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make indexing of eachrow return the object of the same type on a view of the parent #3037

Make indexing of eachrow return the object of the same type on a view of the parent #3037

bkamins commented Apr 9, 2022

bkamins commented Apr 9, 2022

nalimilan commented Apr 9, 2022

bkamins commented Apr 9, 2022

bkamins commented Apr 9, 2022

Make indexing of eachrow return the object of the same type on a view of the parent #3037

Make indexing of eachrow return the object of the same type on a view of the parent #3037

Conversation

bkamins commented Apr 9, 2022

bkamins commented Apr 9, 2022

nalimilan commented Apr 9, 2022

bkamins commented Apr 9, 2022

bkamins commented Apr 9, 2022