Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add insert! and pushfirst! #3072

Merged
merged 19 commits into from
Jun 19, 2022
Merged

Add insert! and pushfirst! #3072

merged 19 commits into from
Jun 19, 2022

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Jun 9, 2022

Fixes #2936

@nalimilan - this PR + #3071 (small patch) will complete the row-oriented API for DataFrames.jl.

(unfortunately it is large, that is why I made earlier PRs in this sequence separately)

@bkamins bkamins added the feature label Jun 9, 2022
@bkamins bkamins added this to the 1.4 milestone Jun 9, 2022
@bkamins
Copy link
Member Author

bkamins commented Jun 9, 2022

In the PR I fall back with push! and pushfirst! to insert!. It will have a minimal performance impact, but it should be negligible (and the benefit is that I avoid code duplication).

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you checked whether the performance impact is indeed small? Sounds likely given the overhead of accessing each column anyway.

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframerow/dataframerow.jl Outdated Show resolved Hide resolved
src/dataframerow/dataframerow.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Jun 15, 2022

Have you checked whether the performance impact is indeed small?

Indeed the performance impact is noticeable unfortunately. I will work on it and let you know (all other issues are resolved)

@bkamins
Copy link
Member Author

bkamins commented Jun 15, 2022

As usual if we go into details of performance complexity comes:

  1. interestingly the problematic check was nrow(df), which we should avoid if possible since it is type unstable (of course sometimes it is unavoidable)
  2. I found a case when old push! code was allowing alias in case when in the new code we catch it (and I think it is better we catch all cases of aliasing)

Which means I ended up with custom implementations of all cases :( (thus the PR is even larger).

Here is the benchmark code:

function push_nt()
    df = DataFrame(rand(1, 10), :auto)
    x = copy(df[1, :]) # NamedTuple
    for i in 1:1_000_000
        push!(df, x)
    end
end

function push_vec()
    df = DataFrame(rand(1, 10), :auto)
    x = ones(10) # vector
    for i in 1:1_000_000
        push!(df, x)
    end
end

function pushfirst_nt()
    df = DataFrame(rand(1, 10), :auto)
    x = copy(df[1, :]) # NamedTuple
    for i in 1:1_000_000
        pushfirst!(df, x)
    end
end

function pushfirst_vec()
    df = DataFrame(rand(1, 10), :auto)
    x = ones(10) # vector
    for i in 1:1_000_000
        pushfirst!(df, x)
    end
end

function insert_nt()
    df = DataFrame(rand(1, 10), :auto)
    x = copy(df[1, :]) # NamedTuple
    for i in 1:1_000_000
        insert!(df, i+1, x)
    end
end

function insert_vec()
    df = DataFrame(rand(1, 10), :auto)
    x = ones(10) # vector
    for i in 1:1_000_000
        insert!(df, i+1, x)
    end
end

using BenchmarkTools
@btime push_nt();
@btime push_vec();
@btime pushfirst_nt();
@btime pushfirst_vec();
@btime insert_nt();
@btime insert_vec();

And the results:

THIS PR:

julia> @btime push_nt();
  1.721 s (53989473 allocations: 1.64 GiB)

julia> @btime push_vec();
  1.003 s (30989449 allocations: 502.43 MiB)

julia> @btime pushfirst_nt();
  2.421 s (73979273 allocations: 1.94 GiB)

julia> @btime pushfirst_vec();
  1.144 s (30989449 allocations: 502.43 MiB)

julia> @btime insert_nt();
  2.445 s (85973152 allocations: 2.12 GiB)

julia> @btime insert_vec();
  1.211 s (41983838 allocations: 670.19 MiB)

and before this PR (push! only):

julia> @btime push_nt();
  1.873 s (53989493 allocations: 1.64 GiB)

julia> @btime push_vec();
  1.085 s (30989469 allocations: 502.43 MiB)

So:

  1. current push! is minimally faster than the previous implementation. (this is probably within measurement error)
  2. it is worth to have separate push! implementation
  3. pushfirst! is minimally faster than insert! - we could drop custom implementation of pushfirst!, but I kept it because it allocates less (so it will trigger GC less often apart from benchmark considerations)

@nalimilan
Copy link
Member

Ah too bad. Wouldn't there be a way to share some of the code between the three functions?

@bkamins
Copy link
Member Author

bkamins commented Jun 17, 2022

Wouldn't there be a way to share some of the code between the three functions?

OK - I will share it and let you decide if it is simpler or not.

@bkamins
Copy link
Member Author

bkamins commented Jun 17, 2022

@nalimilan . I have refactored code so that:

  1. there is no code duplication (by using Val all is optimized-out so there is no performance cost of such an approach - I have run the benchmarks I checked above and they are good)
  2. all "insertion" is moved into a single file (it was needed because of DataFrameRow, which is defined in another file)
  3. I moved append! and prepend! to the same file as they logically match the push! and friends
  4. I have made sure that we "rollback" only the columns that were updated. There is one case when we do not correctly roll-back, but it is only for corrupted data frame, and I would leave it as it is now (i.e. with incorrect rollback):
julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      4
   2 │     2      5
   3 │     3      6

julia> push!(df.b, 10)
4-element Vector{Int64}:
  4
  5
  6
 10

julia> push!(df, (100, 101))
┌ Error: Error adding value to column :b.
└ @ DataFrames c:\Users\bogum\.julia\dev\DataFrames\src\dataframe\insertion.jl:680
ERROR: AssertionError: length(col) == nrows

julia> df
3×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      4
   2 │     2      5
   3 │     3      6

(the reason why this rollback is incorrect is that :b had length 4 originally and code thinks that it was 4 because of successful adding of 1 row, while it was originally corrupted on purpose; of course we could also detect this case, but I think it is obscure enough to be acceptable)

docs/src/man/getting_started.md Outdated Show resolved Hide resolved
src/dataframe/insertion.jl Outdated Show resolved Hide resolved
src/dataframe/insertion.jl Outdated Show resolved Hide resolved
src/dataframe/insertion.jl Outdated Show resolved Hide resolved
src/dataframe/insertion.jl Outdated Show resolved Hide resolved
src/dataframe/insertion.jl Outdated Show resolved Hide resolved
src/dataframe/insertion.jl Outdated Show resolved Hide resolved
src/dataframe/insertion.jl Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
src/dataframe/insertion.jl Outdated Show resolved Hide resolved
bkamins and others added 2 commits June 18, 2022 23:15
@bkamins bkamins merged commit 0367281 into main Jun 19, 2022
@bkamins bkamins deleted the bk/add_insert! branch June 19, 2022 09:30
@bkamins
Copy link
Member Author

bkamins commented Jun 19, 2022

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add keepat! and insert! and pushfirst!
2 participants