Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make indexing of eachrow return the object of the same type on a view of the parent #3037

Merged
merged 2 commits into from
Apr 9, 2022

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Apr 9, 2022

Fixes #3023

Difference in performance:

julia> df = DataFrame(x=1:10^7);

julia> er = eachrow(df);

julia> @benchmark er[1:2:end]
BenchmarkTools.Trial: 10000 samples with 930 evaluations.
 Range (min … max):  101.398 ns …   4.602 μs  ┊ GC (min … max): 0.00% … 97.35%
 Time  (median):     114.194 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   125.150 ns ± 188.553 ns  ┊ GC (mean ± σ):  6.88% ±  4.44%

         ▂▇██▄▂
  ▄▄▄▃▄▅▆██████▆▄▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  101 ns           Histogram: frequency by time          179 ns <

 Memory estimate: 160 bytes, allocs estimate: 5.

julia> @benchmark er[1:2:end, 1]
BenchmarkTools.Trial: 24 samples with 1 evaluation.
 Range (min … max):  106.814 ms … 376.798 ms  ┊ GC (min … max):  0.00% … 71.58%
 Time  (median):     222.870 ms               ┊ GC (median):    53.74%
 Time  (mean ± σ):   217.177 ms ±  84.670 ms  ┊ GC (mean ± σ):  50.89% ± 25.07%

  █▃         ▃▃                 █
  ██▁▁▁▁▁▁▁▁▁██▁▁▁▇▁▁▁▁▁▁▇▁▇▁▇▁▇█▁▁▇▇▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁▇▁▇▁▇ ▁
  107 ms           Histogram: frequency by time          377 ms <

 Memory estimate: 343.32 MiB, allocs estimate: 5000007.

julia> s = trues(10^7);

julia> @benchmark er[s]
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  30.100 μs … 136.300 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     33.600 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   34.640 μs ±   6.310 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃▅▇▇██▇▇▆▅▄▃▂ ▁ ▂▁ ▁  ▁                                      ▃
  ████████████████████▇▇█▇▇▇▇▇█▇▇█▇▅▆▆▆▆▆▄▅▅▅▅▄▄▁▅▅▄▄▃▅▄▃▁▄▁▃▅ █
  30.1 μs       Histogram: log(frequency) by time      66.8 μs <

 Memory estimate: 160 bytes, allocs estimate: 4.

julia> @benchmark er[s, 1]
BenchmarkTools.Trial: 10 samples with 1 evaluation.
 Range (min … max):  241.458 ms … 859.491 ms  ┊ GC (min … max):  0.00% … 75.40%
 Time  (median):     485.684 ms               ┊ GC (median):    55.90%
 Time  (mean ± σ):   538.071 ms ± 208.937 ms  ┊ GC (mean ± σ):  59.62% ± 22.31%

  ▁         ▁ ▁    ▁▁          ▁         ▁           █        ▁  
  █▁▁▁▁▁▁▁▁▁█▁█▁▁▁▁██▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁█ ▁
  241 ms           Histogram: frequency by time          859 ms <

 Memory estimate: 686.65 MiB, allocs estimate: 10000004.

For eachcol I have not changed anything as now we have:

Base.@propagate_inbounds Base.getindex(itr::DataFrameColumns, idx::MultiColumnIndex) =
    eachcol(parent(itr)[!, idx])

and the only difference would be that we could change it to:

Base.@propagate_inbounds Base.getindex(itr::DataFrameColumns, idx::MultiColumnIndex) =
    eachcol(@view parent(itr)[!, idx])

but I am not sure it is worth it (I think not).

@bkamins bkamins added this to the 1.4 milestone Apr 9, 2022
@bkamins
Copy link
Member Author

bkamins commented Apr 9, 2022

Also note that still doing er[1:2, 1] or er[1:2, 1:1, 1:1] etc. is allowed and produces Array as previously. I have not disabled this kind of indexing, but maybe we should (I think it is OK to keep it).

@bkamins bkamins requested a review from nalimilan April 9, 2022 16:32
@nalimilan
Copy link
Member

Wouldn't it make sense to use the same type (either DataFrame or SubDataFrame) for both eachcol and eachrow?

I don't see the performance comparison. Am I missing something? :-)

Also note that still doing er[1:2, 1] or er[1:2, 1:1, 1:1] etc. is allowed and produces Array as previously. I have not disabled this kind of indexing, but maybe we should (I think it is OK to keep it).

Yeah, better keep allowing it since it doesn't really hurt, given that we already allow it.

@bkamins
Copy link
Member Author

bkamins commented Apr 9, 2022

I don't see the performance comparison. Am I missing something? :-)

Note that I am comparing e.g. @benchmark er[1:2:end] which uses the new result to @benchmark er[1:2:end, 1]
which produces a Vector (so what we used to produce with @benchmark er[1:2:end] before this PR).

In this way I can easily compare main to this PR without having to restart REPL :).

Wouldn't it make sense to use the same type (either DataFrame or SubDataFrame) for both eachcol and eachrow?

For eachrow we have to use SubDataFrame as we are subsetting rows.

For eachcol we select all rows with ! anyway so SubDataFrame is not needed. The difference is:

  • currently we use eachcol(parent(itr)[!, idx]) which produces a DataFrame that has the same columns (as tested by ===) as original data frame, but it is detached from it (it has a separate index etc.). The benefit is that if you write eachcol(df).somecolumn you get exactly the same column as when you write df.somecolumn
  • instead using eachcol(@view parent(itr)[!, idx]) would mean that eachcol(df).somecolumn would produce a view of df.somecolumn. Which I think is not desirable

@bkamins bkamins merged commit 0c79493 into main Apr 9, 2022
@bkamins bkamins deleted the bk/eachrow_indexing branch April 9, 2022 18:40
@bkamins
Copy link
Member Author

bkamins commented Apr 9, 2022

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make indexing of eachrow and eachcol return the object of the same type on a view of the parent
2 participants