Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of dropmissing #3256

Merged
merged 41 commits into from
Jan 27, 2023

Conversation

svilupp
Copy link
Contributor

@svilupp svilupp commented Dec 23, 2022

Fixes #3254

TODO:
[x] change dropmissing
[x] change dropmissing! (no opportunities found)
[x] add/augment testing suite
[x] check docs and examples for consistency

@bkamins bkamins self-requested a review December 23, 2022 21:29
@bkamins bkamins added this to the patch milestone Dec 23, 2022
@bkamins
Copy link
Member

bkamins commented Dec 23, 2022

Please also make sure to review tests so that they adequately cover any new code paths that might be needed.
Regarding performance you might also want to have a look at completecases as there might be some room for performance gains also (something like e.g. reduce_or!) - but if indeed there is some performance gain potential would need to be benchmarked.

@svilupp
Copy link
Contributor Author

svilupp commented Dec 24, 2022

Yes, of course. I'll review/add tests as necessary.

I have quickly reviewed how it works now and I'm surprised you expect benefits from fusing the copy rows into newdf step and disallowmissing. I'll study it a bit more and benchmark some variants.

Re. completecases, I was actually wondering if it all could be done in a single pass without too much additional code/complexity, eg,

  • initiate all vectors with undef
  • iterate by rows and if !ismissing copy them in
  • at the end "shrink" the vectors if some slots were left empty (because there were missing)

Thanks for the tip! Let me know if you have any more tips or ideas. I'll get into it in a few days when I'm done travelling.
Is reduce_or an existing function or a suggestion for an approach? I couldn't find it.

@bkamins
Copy link
Member

bkamins commented Dec 24, 2022

Your comments are very good. I just threw some ideas that I had. Unfortunately, as you say, it is often the case that one needs to run benchmarks before one can find the best approach as things sometimes are surprising. But for sure there is room for improvement, as DuckDB is noticeably faster.

Two additional notes:

  • it might be the case that for small number and large number of rows other strategies will be efficient
  • similarly it might be the case that number of columns in a data frame matters (and here are two dimensions: total number of columns and number of columns inspected for missing values).

@svilupp
Copy link
Contributor Author

svilupp commented Dec 27, 2022

Quick update from my side

  • build a benchmarking suite to test performance across different length/width/missingness share
  • unfortunately, learned that my single pass idea is 1-2 orders of magnitude slower
  • main driver -> cannot achieve indexing into a DataFrame without allocations (I suspect the unstable indexing return type)
  • not sure that single pass method is possible with many columns (as we need to go "horizontally")

Single-pass Implementation

  • I chose the following base case: 10k rows (row_total), 10 columns (cols_total), 3 columns with missing data (cols_missing), 10% missingness (missingness_share)
  • part of the slowdown is that my function checks all columns for missingness vs the default can skip some (kwargs and optional subsets can be added later.. ) Perhaps a fairer benchmark would be with cols_total == cols_missing`, but the current issue is that ->
  • I'm getting a ridiculous amount of allocations and am unsure how to avoid them (it seems to come from the "instability" of DataFrame indexing across many columns)
# the simplest case with namedtuple row iterator, very slow
function dropmissing2(df::DataFrame)
    # init all vectors
    newdf = DataFrame()
    for col in names(df)
        newdf[!, col] = Vector{Union{eltype(df[!, col]),Missing}}(undef, nrow(df))
    end
    # iterate by row
    last_row = 0
    for row in DataFrames.Tables.namedtupleiterator(df)
        # none of the values are missing
        if !any(ismissing, values(row))
            last_row += 1
            for (key, val) in pairs(row)
                newdf[last_row, key] = val
                # newdf[!, key][idx] = val
            end
        end
    end
    # resize
    resize!(newdf, last_row)
    newdf
end

# construct with pairs
init_without_missing(df) = DataFrame([col => Vector{nonmissingtype(eltype(df[!, col]))}(undef, nrow(df)) for col in names(df)]...)

# construct via internal storage
init_without_missing2(df, nrows, cols) = DataFrame([Vector{nonmissingtype(eltype(col))}(undef, nrows) for col in DataFrames._columns(df)], cols)

# for-loop iteration directly on vectors
function dropmissing3(df::DataFrame)
    nrows = nrow(df) # 2 alloc
    cols = names(df) # 11 alloc
    # init all vectors
    newdf = init_without_missing2(df, nrows, cols) # 90 alloc
    # iterate by row
    last_row = 1
    @inbounds for i in 1:nrows
        nomissingfound = true
        for col in cols
            nomissingfound *= !ismissing(df[i, col]) # 1 alloc for each access?
            # stop if missing is found
            !nomissingfound && break
            newdf[last_row, col] = df[i, col] # 1 alloc for each access?
        end
        # if there was a missing, overwrite this row in the next iteration
        if nomissingfound
            last_row += 1 # increment for next pass
        end
    end
    # resize to required size
    resize!(newdf, last_row - 1) # 10 alloc
    newdf
end

@time newdf = dropmissing3(df);
@time newdf = dropmissing3(df);
# 0.045442 seconds (517.22 k allocations: 9.420 MiB)

# correctness check
isequal.(Matrix(newdf), Matrix(dropmissing(df))) |> all
# true

Benchmarking results

# benchmark new implementations
basecase = (; rows_total=10000, cols_total=10, cols_missing=3, missingness_share=0.1)
pl = run_benchmark_suite(Val(:belapsed), [dropmissing, dropmissing!, dropmissing3], basecase)

image

Benchmarking setup

using DataFrames
using BenchmarkTools
using Plots

"generate a data frame with missing values for various scenarios"
function generate_data(; rows_total, cols_total, cols_missing, missingness_share)
    data = rand(rows_total, max(cols_total, cols_missing))
    data = Matrix{Union{Float64,Missing}}(data)
    for j in axes(data, 2)
        if j <= cols_missing
            data[:, j] .= ifelse.(@view(data[:, j]) .<= missingness_share, missing, @view data[:, j])
        end
    end
    DataFrame(data, :auto)
end

"time the function execution on the second run"
timer(method::Val{:elapsed}, func, bc; kwargs...) = (df = generate_data(; bc..., kwargs...); func(df); @elapsed func(df))
timer(method::Val{:belapsed}, func, bc; kwargs...) = (df = generate_data(; bc..., kwargs...); @belapsed($func($df)))
timer(method::Val{:allocated}, func, bc; kwargs...) = (df = generate_data(; bc..., kwargs...); func(df); @allocated func(df))

nt(key, val) = NamedTuple{(Symbol(key),)}((val,))
val_to_title(m) = match(r":([a-z]+)", string(m)).captures[1] |> uppercase

"plot scenarios in place"
function plot_scenarios!(pl, method, func::Function, bc; kwargs...)
    tested_key = intersect(keys(bc), keys(kwargs)) |> only
    data = [timer(method, func, bc; nt(tested_key, val)...) for val in kwargs[tested_key]]
    title = "Effect of $(uppercase(string(tested_key))) on $(val_to_title(method))"
    plot!(pl, kwargs[tested_key], data; label=string(func),
        title, xaxis=:log, yaxis=:log)
    return pl, data
end
function plot_scenarios(method, func::Function, bc; kwargs...)
    pl, data = plot_scenarios!(plot(), method, func, bc; kwargs...)
end

"plot scenarios with multiple functions"
function plot_scenarios(method, func_vec::AbstractVector, bc; pl=plot(), kwargs...)
    for func in func_vec
        pl, _ = plot_scenarios!(pl, method, func, bc; kwargs...)
    end
    pl, nothing
end

"Runs benchmarking and plots results in a 2x2 grid"
function run_benchmark_suite(method, func, bc)
    pltr(; kwargs...) = plot_scenarios(method, func, bc; kwargs...)
    plot([pltr(; rows_total=[10^3, 10^4, 10^5, 10^6])[1],
            pltr(; cols_total=[10, 100, 1000])[1],
            pltr(; cols_missing=[10, 100, 1000])[1],
            pltr(; missingness_share=[0.0001, 0.001, 0.01, 0.1])[1]]...,
        plot_title=string("DataFrames `dropmissing` Benchmark"),
        plot_titlefontsize=12,
        titlefontsize=10, size=(800, 600))
end

# benchmark an individual run
m = Val(:elapsed)
basecase = (; rows_total=10000, cols_total=10, cols_missing=3, missingness_share=0.1)
df = generate_data(; basecase...)
timer(m, dropmissing, basecase) # time a single run

# benchmark along a single dimension
plot_scenarios(m, dropmissing, basecase; rows_total=[10^3, 10^4, 10^5, 10^6]) |> first

# multiple functions at once
plot_scenarios(m, [dropmissing, dropmissing!], basecase; rows_total=[10^3, 10^4, 10^5, 10^6]) |> first

# execute the benchmarking suite
basecase = (; rows_total=10000, cols_total=10, cols_missing=3, missingness_share=0.1)
pl = run_benchmark_suite(Val(:elapsed), [dropmissing, dropmissing!], basecase)

EDIT: The timings of dropmissing! are incorrect as I didn't set evals=1, ie, it evaluated already mutated dataframes. In reality, dropmissing! seems to be much slower for larger vectors.

@bkamins
Copy link
Member

bkamins commented Dec 27, 2022

newdf[last_row, key] = val for sure will be slow as it is type unstable.

In general 10k rows is small. Of course it would be good to optimize also for such case, but I would say priority is to optimize against 10^6+ rows.

Maybe a good idea would be to start with profilling dropmissing and dropmissing! to see where the slowest parts are?

@svilupp
Copy link
Contributor Author

svilupp commented Dec 27, 2022

In general 10k rows is small.

Noted! I've moved up the benchmark to from 10k to 10M.

Fusing together dropmissing + disallowmissing

# construct via internal storage
init_without_missing3(df, nrows) = DataFrame(AbstractVector[Vector{nonmissingtype(eltype(col))}(undef, nrows) for col in DataFrames._columns(df)], copy(DataFrames.index(df)), copycols=false)

# no-copy filler with binary mask (the hope was to be specialized and faster)
function fill_with_mask!(output_vec, input_vec, copy_mask)
    last_row = 0
    @inbounds for i in eachindex(input_vec, copy_mask)
        if copy_mask[i]
            last_row += 1
            output_vec[last_row] = input_vec[i]
        end
    end
end

function dropmissing6(df::AbstractDataFrame)
    rowidxs = completecases(df)
    # init without missing
    newdf = init_without_missing3(df, sum(rowidxs))
    # grab data references
    newdf_data = DataFrames._columns(newdf)
    df_data = DataFrames._columns(df)
    for i in eachindex(newdf_data, df_data)
        # fill data references without copying
        fill_with_mask!(newdf_data[i], df_data[i], rowidxs)
    end
    return newdf
end

### Alternative
# getter without missing type
function getindex_not_missing(vec, missing_mask)
    output = Vector{nonmissingtype(eltype(vec))}(undef, sum(missing_mask))
    output[:] .= vec[missing_mask]
    output
end

function dropmissing7(df::AbstractDataFrame)
    rowidxs = completecases(df)
    # init without missing
    new_columns = Vector{AbstractVector}(undef, ncol(df))
    # grab data references
    df_data = DataFrames._columns(df)
    for i in eachindex(new_columns, df_data)
        @inbounds new_columns[i] = getindex_not_missing(df_data[i], rowidxs)
    end
    return DataFrame(new_columns, copy(DataFrames.index(df)), copycols=false)
end

Results

# Benchmark - `dropmissing` still wins in benchmarks
@time dropmissing(df); # default
# 0.013654 seconds (129 allocations: 59.327 MiB)
@time dropmissing6(df); # fewer allocs but slower
# 0.035927 seconds (106 allocations: 26.819 MiB)
@time a = dropmissing7(df); # best trade-off but not fast enough
# 0.016576 seconds (116 allocations: 56.673 MiB)

# Correctness check - PASSES
@time newdf = dropmissing6(df);
isequal.(Matrix(newdf), Matrix(dropmissing(df))) |> all
# true
@time newdf = dropmissing7(df);
isequal.(Matrix(newdf), Matrix(dropmissing(df))) |> all
# true

image

The default algorithm is still pulling ahead in most measures (especially in larger vectors, which is unexpected with fewer allocations needed...)

Zoom-in on allocations
image
Clearly, variant 6 has fewer allocations but is not faster

@svilupp
Copy link
Contributor Author

svilupp commented Jan 7, 2023

Finally some progress, but I'm struggling to get more than 2x speedup (results below).
image

Changes:

  • keep completecases bitvector but convert it to the indices of non-missing rows at the top level
  • then we can allocate the right sized vector without missing type and fill them (getindex_not_missing3)
  • threading added to honour the same rules as the default implementation

Setup

"Getter function that produces a vector with missing type"
function getindex_not_missing3(vect, rowidxs)
    output = Vector{nonmissingtype(eltype(vect))}(undef, length(rowidxs))
    last_ind = 0
    @inbounds for i in eachindex(rowidxs)
        output[last_ind+=1] = vect[rowidxs[i]]
    end
    output
end
vect = ones(Union{Float32,Missing}, 10^7)
row_pos = 1:10^7
@time getindex_not_missing3(vect, row_pos) # 2 allocations
# 0.008571 seconds (2 allocations: 38.147 MiB)

function dropmissing8(df::AbstractDataFrame)
    # Note: timings on the right hand side are for a vector of 10^7 Float32 elements
    # get bitvector of rows with no missing values
    rowmask = completecases(df) # 8.6ms
    # convert to indices -> this also informs us about the right size
    rowidxs = DataFrames._findall(rowmask) # 5-6ms
    # init new columns (data)
    new_columns = Vector{AbstractVector}(undef, ncol(df))
    # grab data references
    df_columns = DataFrames._columns(df)
    # threading decision rule borrowed from `_threaded_getindex`
    if nrow(df) >= 1_000_000 && Threads.nthreads() > 1
        @sync for i in eachindex(new_columns)
            # creates a vector of the right size, without missing type, and copies only data at indices in rowidxs
            # replaces filtering + disallowmissing in two steps
            Threads.@spawn @inbounds new_columns[i] = getindex_not_missing3(df_columns[i], rowidxs) #c. 3.6ms per call
        end
    else
        for i in eachindex(new_columns)
            @inbounds new_columns[i] = getindex_not_missing3(df_columns[i], rowidxs)
        end
    end
    return DataFrame(new_columns, copy(DataFrames.index(df)), copycols=false)
end

Timings

basecase = (; rows_total=10000000, cols_total=10, cols_missing=10, missingness_share=0.1)
df = generate_data(; basecase...)
@time dropmissing8(df);
# 0.036446 seconds (270 allocations: 295.067 MiB)
# without the threading it would be: 0.061811 seconds (98 allocations: 295.358 MiB)
@btime dropmissing8($df); # 51-52ms not threaded / 27-28ms threaded
# 27.359 ms (172 allocations: 295.06 MiB)


# Baseline
@time dropmissing(df);
# 0.074098 seconds (205 allocations: 594.339 MiB)
@btime dropmissing($df);
# 60.561 ms (202 allocations: 594.34 MiB)

# Correctness test - pass
@time newdf = dropmissing8(df);
isequal.(Matrix(newdf), Matrix(dropmissing(df))) |> all # true

@svilupp
Copy link
Contributor Author

svilupp commented Jan 7, 2023

Just as a curiosity - I was curious how it would look if Threading was always on (dropmissing9)

  • It seems that on my system the break-even is at 1000 rows (with 10 columns/6 threads), whilst the decision rule looks for >1 million rows...
  • In addition, I'm not sure what the difference is between Threads.@threads and @spawn-ing, but the former seems to have better performance on my system? (I need to double check it's not because of removing the IF condition, but that should be as costly)

image

"Identical to dropmissing8 but use `Threads.@threads`"
function dropmissing9(df::AbstractDataFrame)
    # Note: timings on the right hand side are for a vector of 10^7 Float32 elements
    # get bitvector of rows with no missing values
    rowmask = completecases(df) # 8.6ms
    # convert to indices -> this also informs us about the right size
    rowidxs = DataFrames._findall(rowmask) # 5-6ms
    #markinit new columns (data)
    new_columns = Vector{AbstractVector}(undef, ncol(df))
    # grab data references
    df_columns = DataFrames._columns(df)
    Threads.@threads for i in eachindex(new_columns)
        # creates a vector of the right size, without missing type, and copies only data at indices in rowidxs
        # replaces filtering + disallowmissing in two steps
        @inbounds new_columns[i] = getindex_not_missing3(df_columns[i], rowidxs) #c. 3.6ms per call
    end
    return DataFrame(new_columns, copy(DataFrames.index(df)), copycols=false)
end

@bkamins
Copy link
Member

bkamins commented Jan 8, 2023

It seems that on my system the break-even is at 1000 rows (with 10 columns/6 threads), whilst the decision rule looks for >1 million rows...

We wanted to be on a safe side (i.e. do not use threads unless there is a clear benefit). In particular - threading benefit might vary across machines on which code is run.

Still - the optimal threshold might be different for different operations so it does not have to be the same everywhere.

In addition, I'm not sure what the difference is between Threads.@threads and @spawn-ing, but the former seems to have better performance on my system? (I need to double check it's not because of removing the IF condition, but that should be as costly)

We need to keep Julia 1.6 compatibility, where @threads only had static scheduler and there was an issue with @threads composability (i.e. when the DataFrames.jl operation that is potentially multi-threaded would be itself spawned in a multi-threaded code). @spawn does not have these issues.

@svilupp
Copy link
Contributor Author

svilupp commented Jan 14, 2023

Hi @bkamins,

I've tried to integrate the new implementation of dropmissing into the DF codebase.
Could you please check if its acceptable?

My thinking:

  • I tried to mimic your current design and variable names, but, unfortunately, I couldn't leverage the getindex methods
  • This implementation mostly relies on _getindex_disallowmissing for a vector, which is meant to be an internal method fusing getindex+disallowmissing! together to save cycles; This is then wrapped in a method of the same name for DataFrames that handles when to choose getindex() vs _getindex_disallowmissing based on user's request
  • While getindex() has many upstream use cases, this has only one, so I chose to write fewer lines of code (eg, not create separate methods for dropmissing(df,:) vs dropmissing(df,:x1) vs dropmissing(df,[:x1,:x2]))
  • I've broken your abstraction hierarchy, where the special threading case is hidden in a dedicated function (_getindex_threaded) to avoid unnecessary LoC, and add it inside the loop of getindex_disallowmissing)

Sharp edges:

  • Given the single purpose, I didn't add try-catch loops to _getindex_disallowmissing to give nicer errors, because it's an internal method and it's usage is safe (we know that those indices don't have missing thanks to completecases

I've tested correctness against the default function, but once we agree on the design, I'll add some individual tests.

Next steps:
[x] Docstrings. I've added comments to explain the logic, but its usage and implementation are not changing, so I intend to keep the current docstrings
[ ] Design. Agree on design/integration of the new function
[ ] Propose a new threading rule (num_rows)
[ ] Tests

Side observations:

  • At large row volumes, mutating version dropmissing! gets very slow (my benchmark above was wrong as it didn't have evals=1 in the benchmarking macro). I've checked the implementation with deleteat! and I don't see how I would significantly improve that without any allocation to keep it's non-mutating "spirit"
  • I have re-ordered the threading if condition ("to thread or not to thread") to first check if there are any threads available (it's the cheapest one, so it can short-circuit quickly)

Performance results:
The new implementation does speed it up a little bit (_dropmissing is the one)

basecase = (; rows_total=10^7, cols_total=10, cols_missing=3, missingness_share=0.1)
df = generate_data(; basecase...)

# Perf
@time dropmissing(df);
# 0.176595 seconds (301 allocations: 594.392 MiB, 57.60% gc time)
@time _dropmissing(df);
# 0.078146 seconds (172 allocations: 295.083 MiB)
@btime dropmissing($df);
# 60.178 ms (202 allocations: 594.39 MiB)
@btime _dropmissing($df);
# 28.326 ms (172 allocations: 295.08 MiB)

# Correctness check
newdf = _dropmissing(df);
@assert isequal(newdf, dropmissing(df))
newdf = _dropmissing(df,[:x1,:x2]);
@assert isequal(newdf, dropmissing(df,[:x1,:x2]))

benchmark_20230114

@svilupp
Copy link
Contributor Author

svilupp commented Jan 14, 2023

As for the decision of when to engage in threading, I did a quick benchmark (see below).
My recommendation would be to start threading when nrow(df)>100_000 (conservative)
However, I'd like to encourage consideration for a 10_000 threshold (IMO it's a better trade-off for real-world data sets).

I've looked at the breakeven point for threading across a different number of rows and number of columns (max 6 columns as I have 6 threads) and also with the special case when only 1 column has missings (ie, fewer rows dropped)

Findings:

  • The breakeven on my machine seems to be around 50_000 rows already with 2 columns
  • With 6 columns (leveraging all threads), this breakeven moves to 5_000 (with only 1 missing column, it's closer to 50k if all columns have missing; it's because more rows get dropped with more missing columns and hence the threading overhead is more significant)
  • This breakeven keeps going down but seems to stay above 1_000 rows, eg, with 18 columns and 1 missing (turning around each thread 3 times), this breakeven moves to c. 2_000 rows
  • I recommended 100_000 rows, because I suspect that threading might be more efficient on my machine (M1 Pro/32gb) vs the average user
  • However, it is worth consideration to set it to ~10_000 rows --> assumption being that most users have 4 threads and most dataframes have >4 columns (ie, they are not in the "long" format during the clean-up phase);
    even if threading was slower, we're around 10^-4s time scale, so a user will never notice any accidental slow down, but they significantly benefit if there are many columns (eg, with 18columns and 50k rows the speed-up from threading is 5x)

The case when all columns have missing (more rows dropped)
Look where the green line (threaded code) crosses the red line (no threading)
benchmark_threading_missing_match_20230114

The case when only 1 column has a missing (fewer rows dropped)
benchmark_threading_missing_1_20230114

Versioninfo

System: 6 threads, 32GB RAM
Julia Version 1.8.4
Commit 00177ebc4fc (2022-12-23 21:32 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin21.5.0)
CPU: 8 × Apple M1 Pro
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
Threads: 1 on 6 virtual cores

Code:

plt_array=[]
for num_cols in 1:6
    num_cols_missing=1 # or num_cols
    basecase = (; rows_total=10^6, cols_total=num_cols, cols_missing=num_cols_missing, missingness_share=0.1)
    @time pl = plot_scenarios(Val(:belapsed), [dropmissing,_dropmissing,_dropmissing_threaded], basecase; 
        rows_total=[10^3,5*10^3,10^4,5*10^4,10^5],title="# Cols: $num_cols / # Missing: $num_cols_missing") |> first
    push!(plt_array, plot(pl, title = "# Cols: $num_cols / # Missing: $num_cols_missing"))
end
pl=plot(plt_array...,layout=(2,3),plot_title="Effect of threading on dropmissing",size=(1000, 800))

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/other/utils.jl Outdated Show resolved Hide resolved
src/other/utils.jl Outdated Show resolved Hide resolved
src/other/utils.jl Outdated Show resolved Hide resolved
src/other/utils.jl Outdated Show resolved Hide resolved
- merged all pathways into one function for all abstract dataframes
- removed other methods
- calling function disallowmissing explicitly from package Missings, as it others conflicts with the keyword name
test/data.jl Outdated Show resolved Hide resolved
test/data.jl Outdated Show resolved Hide resolved
test/data.jl Outdated Show resolved Hide resolved
test/data.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member

bkamins commented Jan 24, 2023

@svilupp - note that I have pushed some changes to the PR.

test/data.jl Outdated Show resolved Hide resolved
test/data.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member

bkamins commented Jan 25, 2023

@svilupp - can you please, when you have time, check the changes in the code I made (I have made them in my head, but they pass tests, so hopefully they are OK 😄). Thank you!

@svilupp
Copy link
Contributor Author

svilupp commented Jan 25, 2023

@svilupp - can you please, when you have time, check the changes in the code I made (I have made them in my head, but they pass tests, so hopefully they are OK 😄). Thank you!

All looks good! I really like the change to enumerate(eachcol(df)) - much more elegant.

@svilupp svilupp marked this pull request as ready for review January 25, 2023 20:05
test/data.jl Outdated Show resolved Hide resolved
test/data.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member

bkamins commented Jan 26, 2023

I made some more small tweaks (in particular to use BitSet for lookup). Apart from this things look good. @nalimilan - can you please have a look and approve if all is OK?

Performance improvement on a real case:

julia> summary(df)
"42710197×3 DataFrame"

julia> @btime dropmissing2($df); # new
  171.168 ms (75 allocations: 883.77 MiB)

julia> @btime dropmissing($df); # old
  355.706 ms (82 allocations: 1.58 GiB)

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins bkamins merged commit fdd9193 into JuliaData:main Jan 27, 2023
@bkamins
Copy link
Member

bkamins commented Jan 27, 2023

Thank you! I hope you enjoyed the process. You are welcome to open other PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

improve performance of dropmissing
3 participants