Improve performance of dropmissing #3256

svilupp · 2022-12-23T20:46:22Z

TODO:
~~[x] change dropmissing~~
~~[x] change dropmissing! (no opportunities found)~~
~~[x] add/augment testing suite~~
~~[x] check docs and examples for consistency~~

bkamins · 2022-12-23T22:50:35Z

Please also make sure to review tests so that they adequately cover any new code paths that might be needed.
Regarding performance you might also want to have a look at completecases as there might be some room for performance gains also (something like e.g. reduce_or!) - but if indeed there is some performance gain potential would need to be benchmarked.

svilupp · 2022-12-24T00:48:31Z

Yes, of course. I'll review/add tests as necessary.

I have quickly reviewed how it works now and I'm surprised you expect benefits from fusing the copy rows into newdf step and disallowmissing. I'll study it a bit more and benchmark some variants.

Re. completecases, I was actually wondering if it all could be done in a single pass without too much additional code/complexity, eg,

initiate all vectors with undef
iterate by rows and if !ismissing copy them in
at the end "shrink" the vectors if some slots were left empty (because there were missing)

Thanks for the tip! Let me know if you have any more tips or ideas. I'll get into it in a few days when I'm done travelling.
Is reduce_or an existing function or a suggestion for an approach? I couldn't find it.

bkamins · 2022-12-24T06:50:42Z

Your comments are very good. I just threw some ideas that I had. Unfortunately, as you say, it is often the case that one needs to run benchmarks before one can find the best approach as things sometimes are surprising. But for sure there is room for improvement, as DuckDB is noticeably faster.

Two additional notes:

it might be the case that for small number and large number of rows other strategies will be efficient
similarly it might be the case that number of columns in a data frame matters (and here are two dimensions: total number of columns and number of columns inspected for missing values).

svilupp · 2022-12-27T00:58:04Z

Quick update from my side

build a benchmarking suite to test performance across different length/width/missingness share
unfortunately, learned that my single pass idea is 1-2 orders of magnitude slower
main driver -> cannot achieve indexing into a DataFrame without allocations (I suspect the unstable indexing return type)
not sure that single pass method is possible with many columns (as we need to go "horizontally")

Single-pass Implementation

I chose the following base case: 10k rows (row_total), 10 columns (cols_total), 3 columns with missing data (cols_missing), 10% missingness (missingness_share)
part of the slowdown is that my function checks all columns for missingness vs the default can skip some (kwargs and optional subsets can be added later.. ) Perhaps a fairer benchmark would be with cols_total == cols_missing`, but the current issue is that ->
I'm getting a ridiculous amount of allocations and am unsure how to avoid them (it seems to come from the "instability" of DataFrame indexing across many columns)

# the simplest case with namedtuple row iterator, very slow
function dropmissing2(df::DataFrame)
    # init all vectors
    newdf = DataFrame()
    for col in names(df)
        newdf[!, col] = Vector{Union{eltype(df[!, col]),Missing}}(undef, nrow(df))
    end
    # iterate by row
    last_row = 0
    for row in DataFrames.Tables.namedtupleiterator(df)
        # none of the values are missing
        if !any(ismissing, values(row))
            last_row += 1
            for (key, val) in pairs(row)
                newdf[last_row, key] = val
                # newdf[!, key][idx] = val
            end
        end
    end
    # resize
    resize!(newdf, last_row)
    newdf
end

# construct with pairs
init_without_missing(df) = DataFrame([col => Vector{nonmissingtype(eltype(df[!, col]))}(undef, nrow(df)) for col in names(df)]...)

# construct via internal storage
init_without_missing2(df, nrows, cols) = DataFrame([Vector{nonmissingtype(eltype(col))}(undef, nrows) for col in DataFrames._columns(df)], cols)

# for-loop iteration directly on vectors
function dropmissing3(df::DataFrame)
    nrows = nrow(df) # 2 alloc
    cols = names(df) # 11 alloc
    # init all vectors
    newdf = init_without_missing2(df, nrows, cols) # 90 alloc
    # iterate by row
    last_row = 1
    @inbounds for i in 1:nrows
        nomissingfound = true
        for col in cols
            nomissingfound *= !ismissing(df[i, col]) # 1 alloc for each access?
            # stop if missing is found
            !nomissingfound && break
            newdf[last_row, col] = df[i, col] # 1 alloc for each access?
        end
        # if there was a missing, overwrite this row in the next iteration
        if nomissingfound
            last_row += 1 # increment for next pass
        end
    end
    # resize to required size
    resize!(newdf, last_row - 1) # 10 alloc
    newdf
end

@time newdf = dropmissing3(df);
@time newdf = dropmissing3(df);
# 0.045442 seconds (517.22 k allocations: 9.420 MiB)

# correctness check
isequal.(Matrix(newdf), Matrix(dropmissing(df))) |> all
# true

Benchmarking results

# benchmark new implementations
basecase = (; rows_total=10000, cols_total=10, cols_missing=3, missingness_share=0.1)
pl = run_benchmark_suite(Val(:belapsed), [dropmissing, dropmissing!, dropmissing3], basecase)

Benchmarking setup

using DataFrames
using BenchmarkTools
using Plots

"generate a data frame with missing values for various scenarios"
function generate_data(; rows_total, cols_total, cols_missing, missingness_share)
    data = rand(rows_total, max(cols_total, cols_missing))
    data = Matrix{Union{Float64,Missing}}(data)
    for j in axes(data, 2)
        if j <= cols_missing
            data[:, j] .= ifelse.(@view(data[:, j]) .<= missingness_share, missing, @view data[:, j])
        end
    end
    DataFrame(data, :auto)
end

"time the function execution on the second run"
timer(method::Val{:elapsed}, func, bc; kwargs...) = (df = generate_data(; bc..., kwargs...); func(df); @elapsed func(df))
timer(method::Val{:belapsed}, func, bc; kwargs...) = (df = generate_data(; bc..., kwargs...); @belapsed($func($df)))
timer(method::Val{:allocated}, func, bc; kwargs...) = (df = generate_data(; bc..., kwargs...); func(df); @allocated func(df))

nt(key, val) = NamedTuple{(Symbol(key),)}((val,))
val_to_title(m) = match(r":([a-z]+)", string(m)).captures[1] |> uppercase

"plot scenarios in place"
function plot_scenarios!(pl, method, func::Function, bc; kwargs...)
    tested_key = intersect(keys(bc), keys(kwargs)) |> only
    data = [timer(method, func, bc; nt(tested_key, val)...) for val in kwargs[tested_key]]
    title = "Effect of $(uppercase(string(tested_key))) on $(val_to_title(method))"
    plot!(pl, kwargs[tested_key], data; label=string(func),
        title, xaxis=:log, yaxis=:log)
    return pl, data
end
function plot_scenarios(method, func::Function, bc; kwargs...)
    pl, data = plot_scenarios!(plot(), method, func, bc; kwargs...)
end

"plot scenarios with multiple functions"
function plot_scenarios(method, func_vec::AbstractVector, bc; pl=plot(), kwargs...)
    for func in func_vec
        pl, _ = plot_scenarios!(pl, method, func, bc; kwargs...)
    end
    pl, nothing
end

"Runs benchmarking and plots results in a 2x2 grid"
function run_benchmark_suite(method, func, bc)
    pltr(; kwargs...) = plot_scenarios(method, func, bc; kwargs...)
    plot([pltr(; rows_total=[10^3, 10^4, 10^5, 10^6])[1],
            pltr(; cols_total=[10, 100, 1000])[1],
            pltr(; cols_missing=[10, 100, 1000])[1],
            pltr(; missingness_share=[0.0001, 0.001, 0.01, 0.1])[1]]...,
        plot_title=string("DataFrames `dropmissing` Benchmark"),
        plot_titlefontsize=12,
        titlefontsize=10, size=(800, 600))
end

# benchmark an individual run
m = Val(:elapsed)
basecase = (; rows_total=10000, cols_total=10, cols_missing=3, missingness_share=0.1)
df = generate_data(; basecase...)
timer(m, dropmissing, basecase) # time a single run

# benchmark along a single dimension
plot_scenarios(m, dropmissing, basecase; rows_total=[10^3, 10^4, 10^5, 10^6]) |> first

# multiple functions at once
plot_scenarios(m, [dropmissing, dropmissing!], basecase; rows_total=[10^3, 10^4, 10^5, 10^6]) |> first

# execute the benchmarking suite
basecase = (; rows_total=10000, cols_total=10, cols_missing=3, missingness_share=0.1)
pl = run_benchmark_suite(Val(:elapsed), [dropmissing, dropmissing!], basecase)

EDIT: The timings of dropmissing! are incorrect as I didn't set evals=1, ie, it evaluated already mutated dataframes. In reality, dropmissing! seems to be much slower for larger vectors.

bkamins · 2022-12-27T07:36:03Z

newdf[last_row, key] = val for sure will be slow as it is type unstable.

In general 10k rows is small. Of course it would be good to optimize also for such case, but I would say priority is to optimize against 10^6+ rows.

Maybe a good idea would be to start with profilling dropmissing and dropmissing! to see where the slowest parts are?

svilupp · 2022-12-27T15:22:25Z

In general 10k rows is small.

Noted! I've moved up the benchmark to from 10k to 10M.

Fusing together dropmissing + disallowmissing

# construct via internal storage
init_without_missing3(df, nrows) = DataFrame(AbstractVector[Vector{nonmissingtype(eltype(col))}(undef, nrows) for col in DataFrames._columns(df)], copy(DataFrames.index(df)), copycols=false)

# no-copy filler with binary mask (the hope was to be specialized and faster)
function fill_with_mask!(output_vec, input_vec, copy_mask)
    last_row = 0
    @inbounds for i in eachindex(input_vec, copy_mask)
        if copy_mask[i]
            last_row += 1
            output_vec[last_row] = input_vec[i]
        end
    end
end

function dropmissing6(df::AbstractDataFrame)
    rowidxs = completecases(df)
    # init without missing
    newdf = init_without_missing3(df, sum(rowidxs))
    # grab data references
    newdf_data = DataFrames._columns(newdf)
    df_data = DataFrames._columns(df)
    for i in eachindex(newdf_data, df_data)
        # fill data references without copying
        fill_with_mask!(newdf_data[i], df_data[i], rowidxs)
    end
    return newdf
end

### Alternative
# getter without missing type
function getindex_not_missing(vec, missing_mask)
    output = Vector{nonmissingtype(eltype(vec))}(undef, sum(missing_mask))
    output[:] .= vec[missing_mask]
    output
end

function dropmissing7(df::AbstractDataFrame)
    rowidxs = completecases(df)
    # init without missing
    new_columns = Vector{AbstractVector}(undef, ncol(df))
    # grab data references
    df_data = DataFrames._columns(df)
    for i in eachindex(new_columns, df_data)
        @inbounds new_columns[i] = getindex_not_missing(df_data[i], rowidxs)
    end
    return DataFrame(new_columns, copy(DataFrames.index(df)), copycols=false)
end

Results

# Benchmark - `dropmissing` still wins in benchmarks
@time dropmissing(df); # default
# 0.013654 seconds (129 allocations: 59.327 MiB)
@time dropmissing6(df); # fewer allocs but slower
# 0.035927 seconds (106 allocations: 26.819 MiB)
@time a = dropmissing7(df); # best trade-off but not fast enough
# 0.016576 seconds (116 allocations: 56.673 MiB)

# Correctness check - PASSES
@time newdf = dropmissing6(df);
isequal.(Matrix(newdf), Matrix(dropmissing(df))) |> all
# true
@time newdf = dropmissing7(df);
isequal.(Matrix(newdf), Matrix(dropmissing(df))) |> all
# true

The default algorithm is still pulling ahead in most measures (especially in larger vectors, which is unexpected with fewer allocations needed...)

Zoom-in on allocations

Clearly, variant 6 has fewer allocations but is not faster

svilupp · 2023-01-07T23:22:27Z

Finally some progress, but I'm struggling to get more than 2x speedup (results below).

Changes:

keep completecases bitvector but convert it to the indices of non-missing rows at the top level
then we can allocate the right sized vector without missing type and fill them (getindex_not_missing3)
threading added to honour the same rules as the default implementation

Setup

"Getter function that produces a vector with missing type"
function getindex_not_missing3(vect, rowidxs)
    output = Vector{nonmissingtype(eltype(vect))}(undef, length(rowidxs))
    last_ind = 0
    @inbounds for i in eachindex(rowidxs)
        output[last_ind+=1] = vect[rowidxs[i]]
    end
    output
end
vect = ones(Union{Float32,Missing}, 10^7)
row_pos = 1:10^7
@time getindex_not_missing3(vect, row_pos) # 2 allocations
# 0.008571 seconds (2 allocations: 38.147 MiB)

function dropmissing8(df::AbstractDataFrame)
    # Note: timings on the right hand side are for a vector of 10^7 Float32 elements
    # get bitvector of rows with no missing values
    rowmask = completecases(df) # 8.6ms
    # convert to indices -> this also informs us about the right size
    rowidxs = DataFrames._findall(rowmask) # 5-6ms
    # init new columns (data)
    new_columns = Vector{AbstractVector}(undef, ncol(df))
    # grab data references
    df_columns = DataFrames._columns(df)
    # threading decision rule borrowed from `_threaded_getindex`
    if nrow(df) >= 1_000_000 && Threads.nthreads() > 1
        @sync for i in eachindex(new_columns)
            # creates a vector of the right size, without missing type, and copies only data at indices in rowidxs
            # replaces filtering + disallowmissing in two steps
            Threads.@spawn @inbounds new_columns[i] = getindex_not_missing3(df_columns[i], rowidxs) #c. 3.6ms per call
        end
    else
        for i in eachindex(new_columns)
            @inbounds new_columns[i] = getindex_not_missing3(df_columns[i], rowidxs)
        end
    end
    return DataFrame(new_columns, copy(DataFrames.index(df)), copycols=false)
end

Timings

basecase = (; rows_total=10000000, cols_total=10, cols_missing=10, missingness_share=0.1)
df = generate_data(; basecase...)
@time dropmissing8(df);
# 0.036446 seconds (270 allocations: 295.067 MiB)
# without the threading it would be: 0.061811 seconds (98 allocations: 295.358 MiB)
@btime dropmissing8($df); # 51-52ms not threaded / 27-28ms threaded
# 27.359 ms (172 allocations: 295.06 MiB)


# Baseline
@time dropmissing(df);
# 0.074098 seconds (205 allocations: 594.339 MiB)
@btime dropmissing($df);
# 60.561 ms (202 allocations: 594.34 MiB)

# Correctness test - pass
@time newdf = dropmissing8(df);
isequal.(Matrix(newdf), Matrix(dropmissing(df))) |> all # true

svilupp · 2023-01-07T23:34:01Z

Just as a curiosity - I was curious how it would look if Threading was always on (dropmissing9)

It seems that on my system the break-even is at 1000 rows (with 10 columns/6 threads), whilst the decision rule looks for >1 million rows...
In addition, I'm not sure what the difference is between Threads.@threads and @spawn-ing, but the former seems to have better performance on my system? (I need to double check it's not because of removing the IF condition, but that should be as costly)

"Identical to dropmissing8 but use `Threads.@threads`"
function dropmissing9(df::AbstractDataFrame)
    # Note: timings on the right hand side are for a vector of 10^7 Float32 elements
    # get bitvector of rows with no missing values
    rowmask = completecases(df) # 8.6ms
    # convert to indices -> this also informs us about the right size
    rowidxs = DataFrames._findall(rowmask) # 5-6ms
    #markinit new columns (data)
    new_columns = Vector{AbstractVector}(undef, ncol(df))
    # grab data references
    df_columns = DataFrames._columns(df)
    Threads.@threads for i in eachindex(new_columns)
        # creates a vector of the right size, without missing type, and copies only data at indices in rowidxs
        # replaces filtering + disallowmissing in two steps
        @inbounds new_columns[i] = getindex_not_missing3(df_columns[i], rowidxs) #c. 3.6ms per call
    end
    return DataFrame(new_columns, copy(DataFrames.index(df)), copycols=false)
end

bkamins · 2023-01-08T06:44:41Z

It seems that on my system the break-even is at 1000 rows (with 10 columns/6 threads), whilst the decision rule looks for >1 million rows...

We wanted to be on a safe side (i.e. do not use threads unless there is a clear benefit). In particular - threading benefit might vary across machines on which code is run.

Still - the optimal threshold might be different for different operations so it does not have to be the same everywhere.

In addition, I'm not sure what the difference is between Threads.@threads and @spawn-ing, but the former seems to have better performance on my system? (I need to double check it's not because of removing the IF condition, but that should be as costly)

We need to keep Julia 1.6 compatibility, where @threads only had static scheduler and there was an issue with @threads composability (i.e. when the DataFrames.jl operation that is potentially multi-threaded would be itself spawned in a multi-threaded code). @spawn does not have these issues.

…-perf

svilupp · 2023-01-14T18:27:29Z

Hi @bkamins,

I've tried to integrate the new implementation of dropmissing into the DF codebase.
Could you please check if its acceptable?

My thinking:

I tried to mimic your current design and variable names, but, unfortunately, I couldn't leverage the getindex methods
This implementation mostly relies on _getindex_disallowmissing for a vector, which is meant to be an internal method fusing getindex+disallowmissing! together to save cycles; This is then wrapped in a method of the same name for DataFrames that handles when to choose getindex() vs _getindex_disallowmissing based on user's request
While getindex() has many upstream use cases, this has only one, so I chose to write fewer lines of code (eg, not create separate methods for dropmissing(df,:) vs dropmissing(df,:x1) vs dropmissing(df,[:x1,:x2]))
I've broken your abstraction hierarchy, where the special threading case is hidden in a dedicated function (_getindex_threaded) to avoid unnecessary LoC, and add it inside the loop of getindex_disallowmissing)

Sharp edges:

Given the single purpose, I didn't add try-catch loops to _getindex_disallowmissing to give nicer errors, because it's an internal method and it's usage is safe (we know that those indices don't have missing thanks to completecases

I've tested correctness against the default function, but once we agree on the design, I'll add some individual tests.

Next steps:
[x] Docstrings. I've added comments to explain the logic, but its usage and implementation are not changing, so I intend to keep the current docstrings
[ ] Design. Agree on design/integration of the new function
[ ] Propose a new threading rule (num_rows)
[ ] Tests

Side observations:

At large row volumes, mutating version dropmissing! gets very slow (my benchmark above was wrong as it didn't have evals=1 in the benchmarking macro). I've checked the implementation with deleteat! and I don't see how I would significantly improve that without any allocation to keep it's non-mutating "spirit"
I have re-ordered the threading if condition ("to thread or not to thread") to first check if there are any threads available (it's the cheapest one, so it can short-circuit quickly)

Performance results:
The new implementation does speed it up a little bit (_dropmissing is the one)

basecase = (; rows_total=10^7, cols_total=10, cols_missing=3, missingness_share=0.1)
df = generate_data(; basecase...)

# Perf
@time dropmissing(df);
# 0.176595 seconds (301 allocations: 594.392 MiB, 57.60% gc time)
@time _dropmissing(df);
# 0.078146 seconds (172 allocations: 295.083 MiB)
@btime dropmissing($df);
# 60.178 ms (202 allocations: 594.39 MiB)
@btime _dropmissing($df);
# 28.326 ms (172 allocations: 295.08 MiB)

# Correctness check
newdf = _dropmissing(df);
@assert isequal(newdf, dropmissing(df))
newdf = _dropmissing(df,[:x1,:x2]);
@assert isequal(newdf, dropmissing(df,[:x1,:x2]))

svilupp · 2023-01-14T19:27:14Z

As for the decision of when to engage in threading, I did a quick benchmark (see below).
My recommendation would be to start threading when nrow(df)>100_000 (conservative)
However, I'd like to encourage consideration for a 10_000 threshold (IMO it's a better trade-off for real-world data sets).

I've looked at the breakeven point for threading across a different number of rows and number of columns (max 6 columns as I have 6 threads) and also with the special case when only 1 column has missings (ie, fewer rows dropped)

Findings:

The breakeven on my machine seems to be around 50_000 rows already with 2 columns
With 6 columns (leveraging all threads), this breakeven moves to 5_000 (with only 1 missing column, it's closer to 50k if all columns have missing; it's because more rows get dropped with more missing columns and hence the threading overhead is more significant)
This breakeven keeps going down but seems to stay above 1_000 rows, eg, with 18 columns and 1 missing (turning around each thread 3 times), this breakeven moves to c. 2_000 rows
I recommended 100_000 rows, because I suspect that threading might be more efficient on my machine (M1 Pro/32gb) vs the average user
However, it is worth consideration to set it to ~10_000 rows --> assumption being that most users have 4 threads and most dataframes have >4 columns (ie, they are not in the "long" format during the clean-up phase);
even if threading was slower, we're around 10^-4s time scale, so a user will never notice any accidental slow down, but they significantly benefit if there are many columns (eg, with 18columns and 50k rows the speed-up from threading is 5x)

The case when all columns have missing (more rows dropped)
Look where the green line (threaded code) crosses the red line (no threading)

The case when only 1 column has a missing (fewer rows dropped)

Versioninfo

System: 6 threads, 32GB RAM
Julia Version 1.8.4
Commit 00177ebc4fc (2022-12-23 21:32 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin21.5.0)
CPU: 8 × Apple M1 Pro
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
Threads: 1 on 6 virtual cores

Code:

plt_array=[]
for num_cols in 1:6
    num_cols_missing=1 # or num_cols
    basecase = (; rows_total=10^6, cols_total=num_cols, cols_missing=num_cols_missing, missingness_share=0.1)
    @time pl = plot_scenarios(Val(:belapsed), [dropmissing,_dropmissing,_dropmissing_threaded], basecase; 
        rows_total=[10^3,5*10^3,10^4,5*10^4,10^5],title="# Cols: $num_cols / # Missing: $num_cols_missing") |> first
    push!(plt_array, plot(pl, title = "# Cols: $num_cols / # Missing: $num_cols_missing"))
end
pl=plot(plt_array...,layout=(2,3),plot_title="Effect of threading on dropmissing",size=(1000, 800))

src/dataframe/dataframe.jl

src/other/utils.jl

Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>

Uses Tables.allocatecolumn constructor

src/other/utils.jl

- merged all pathways into one function for all abstract dataframes - removed other methods - calling function disallowmissing explicitly from package Missings, as it others conflicts with the keyword name

src/abstractdataframe/abstractdataframe.jl

test/data.jl

bkamins · 2023-01-24T16:04:09Z

@svilupp - note that I have pushed some changes to the PR.

test/data.jl

bkamins · 2023-01-25T18:04:42Z

@svilupp - can you please, when you have time, check the changes in the code I made (I have made them in my head, but they pass tests, so hopefully they are OK 😄). Thank you!

svilupp · 2023-01-25T20:05:17Z

@svilupp - can you please, when you have time, check the changes in the code I made (I have made them in my head, but they pass tests, so hopefully they are OK 😄). Thank you!

All looks good! I really like the change to enumerate(eachcol(df)) - much more elegant.

src/abstractdataframe/abstractdataframe.jl

test/data.jl

bkamins · 2023-01-26T16:01:02Z

I made some more small tweaks (in particular to use BitSet for lookup). Apart from this things look good. @nalimilan - can you please have a look and approve if all is OK?

Performance improvement on a real case:

julia> summary(df)
"42710197×3 DataFrame"

julia> @btime dropmissing2($df); # new
  171.168 ms (75 allocations: 883.77 MiB)

julia> @btime dropmissing($df); # old
  355.706 ms (82 allocations: 1.58 GiB)

nalimilan

Thanks!

src/abstractdataframe/abstractdataframe.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2023-01-27T09:22:03Z

Thank you! I hope you enjoyed the process. You are welcome to open other PRs.

init

940c1cd

bkamins self-requested a review December 23, 2022 21:29

bkamins added the performance label Dec 23, 2022

bkamins added this to the patch milestone Dec 23, 2022

svilupp added 2 commits January 14, 2023 10:33

Merge remote-tracking branch 'upstream/main' into improve-dropmissing…

e3ae2b0

…-perf

add dropmissing8 implementation

472a715

bkamins reviewed Jan 14, 2023

View reviewed changes

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved

bkamins reviewed Jan 14, 2023

View reviewed changes

src/other/utils.jl Outdated Show resolved Hide resolved

bkamins reviewed Jan 14, 2023

View reviewed changes

src/other/utils.jl Outdated Show resolved Hide resolved

bkamins reviewed Jan 14, 2023

View reviewed changes

src/other/utils.jl Outdated Show resolved Hide resolved

svilupp and others added 5 commits January 15, 2023 06:51

Update src/other/utils.jl

7b5e010

Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>

move last_ind to separate line, add inbounds

dc705e8

support general vectors in getindex_disallowmissing

f52ed41

Uses Tables.allocatecolumn constructor

add general fallback for getindex_disallowmissing

72171c9

method for getindex_disallowmissing for abstractdataframe

4a596ee

bkamins reviewed Jan 16, 2023

View reviewed changes

src/other/utils.jl Outdated Show resolved Hide resolved

svilupp added 2 commits January 21, 2023 11:18

changed inner to disallowmissing+view

6326f32

Clean up

080c018

- merged all pathways into one function for all abstract dataframes - removed other methods - calling function disallowmissing explicitly from package Missings, as it others conflicts with the keyword name

bkamins reviewed Jan 21, 2023

View reviewed changes

src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved

bkamins reviewed Jan 24, 2023

View reviewed changes

src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved

bkamins reviewed Jan 24, 2023

View reviewed changes

src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved

Apply suggestions from code review

3d82b90

bkamins reviewed Jan 24, 2023

View reviewed changes

test/data.jl Outdated Show resolved Hide resolved

bkamins reviewed Jan 24, 2023

View reviewed changes

test/data.jl Outdated Show resolved Hide resolved

bkamins reviewed Jan 24, 2023

View reviewed changes

test/data.jl Outdated Show resolved Hide resolved

bkamins reviewed Jan 24, 2023

View reviewed changes

test/data.jl Outdated Show resolved Hide resolved

Apply suggestions from code review

f26c99a

bkamins reviewed Jan 24, 2023

View reviewed changes

test/data.jl Outdated Show resolved Hide resolved

Update test/data.jl

fffda53

bkamins reviewed Jan 24, 2023

View reviewed changes

test/data.jl Outdated Show resolved Hide resolved

Update test/data.jl

82f193c

svilupp marked this pull request as ready for review January 25, 2023 20:05

nalimilan reviewed Jan 25, 2023

View reviewed changes

src/abstractdataframe/abstractdataframe.jl Show resolved Hide resolved

bkamins reviewed Jan 26, 2023

View reviewed changes

src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved

bkamins reviewed Jan 26, 2023

View reviewed changes

src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved

bkamins reviewed Jan 26, 2023

View reviewed changes

test/data.jl Outdated Show resolved Hide resolved

bkamins reviewed Jan 26, 2023

View reviewed changes

test/data.jl Outdated Show resolved Hide resolved

Apply suggestions from code review

1426e3a

bkamins approved these changes Jan 26, 2023

View reviewed changes

nalimilan approved these changes Jan 27, 2023

View reviewed changes

src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved

src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved

Apply suggestions from code review

39580a2

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins merged commit fdd9193 into JuliaData:main Jan 27, 2023

bkamins mentioned this pull request Jan 27, 2023

Improve allcombinations docstring + minor cleanups after #3256 #3276

Merged

bkamins added a commit that referenced this pull request Jan 27, 2023

Improve allcombinations docstring + minor cleanups after #3256 (#3276)

70d1e23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of dropmissing #3256

Improve performance of dropmissing #3256

svilupp commented Dec 23, 2022 •

edited

Loading

bkamins commented Dec 23, 2022

svilupp commented Dec 24, 2022

bkamins commented Dec 24, 2022

svilupp commented Dec 27, 2022 •

edited

Loading

bkamins commented Dec 27, 2022

svilupp commented Dec 27, 2022 •

edited

Loading

svilupp commented Jan 7, 2023 •

edited

Loading

svilupp commented Jan 7, 2023

bkamins commented Jan 8, 2023 •

edited

Loading

svilupp commented Jan 14, 2023

svilupp commented Jan 14, 2023

bkamins commented Jan 24, 2023

bkamins commented Jan 25, 2023

svilupp commented Jan 25, 2023

bkamins commented Jan 26, 2023

nalimilan left a comment

bkamins commented Jan 27, 2023

Improve performance of dropmissing #3256

Improve performance of dropmissing #3256

Conversation

svilupp commented Dec 23, 2022 • edited Loading

bkamins commented Dec 23, 2022

svilupp commented Dec 24, 2022

bkamins commented Dec 24, 2022

svilupp commented Dec 27, 2022 • edited Loading

bkamins commented Dec 27, 2022

svilupp commented Dec 27, 2022 • edited Loading

svilupp commented Jan 7, 2023 • edited Loading

svilupp commented Jan 7, 2023

bkamins commented Jan 8, 2023 • edited Loading

svilupp commented Jan 14, 2023

svilupp commented Jan 14, 2023

bkamins commented Jan 24, 2023

bkamins commented Jan 25, 2023

svilupp commented Jan 25, 2023

bkamins commented Jan 26, 2023

nalimilan left a comment

Choose a reason for hiding this comment

bkamins commented Jan 27, 2023

svilupp commented Dec 23, 2022 •

edited

Loading

svilupp commented Dec 27, 2022 •

edited

Loading

svilupp commented Dec 27, 2022 •

edited

Loading

svilupp commented Jan 7, 2023 •

edited

Loading

bkamins commented Jan 8, 2023 •

edited

Loading