define Missings.replace method #1377

tpapp · 2018-03-07T10:19:27Z

In light of this discussion and similar ones, it may make sense to define something along the lines of

function Missings.replace(df::DataFrame, value)
    map(col -> collect(Missings.replace(col, value)), eachcol(df))
end

nalimilan · 2018-03-07T14:04:12Z

Missings.replace is lazy, so I don't think that's appropriate for a data frame. I don't have a good solution to suggest...

tpapp · 2018-03-07T14:48:49Z

The (::Any, ::Any) method of Missings.replace is indeed lazy and documented as such, but I am not sure that insisting on lazyness is the right approach for all data types. Does it have to be lazy everywhere? (Should I open an issue for Missings? I guess it would reach the same people).

nalimilan · 2018-03-07T15:33:31Z

Well, the idea is that we have replace(x, missing=>rep) to get a copy of the input vector on Julia 0.7 (JuliaLang/julia#26206 will improve the return type), so the reason why Missings.replace exists is that it's lazy, like Missings.skip and Missings.fail. I would find it confusing to mix lazy and non-lazy semantics in the same function. I don't think we have functions with mixed semantics like that in Base

tpapp · 2018-03-07T15:48:56Z

This sounds reasonable. Perhaps a method for

Base.replace(df::AbstractDataFrame, pat => value)

then? AFAICT that is not required to be lazy.

nalimilan · 2018-03-07T15:56:20Z

Yeah, but currently we don't implement any methods to apply an operation to all columns, you need to do that manually. If we start doing that, why not also implement exp or any other standard function? It could make sense to use broadcasting for that, via something like f.(columns(df)) or f.(rows(df)). So that would give replace.(df, missing => rep). Obviously that's much more ambitious, though, but API consistency really matters.

tpapp · 2018-03-07T16:21:07Z

So that would give replace.(df, missing => rep)

I am not sure I understand, I think of dataframes as collections, so I would not broadcast replace.

API consistency really matters

I agree, but currently it is somewhat convoluted to do some simple operations. Cf R,

> df = data.frame(a = c(NA, 1), b = c(2, NA))
> df[is.na(df)] = 0
> df
  a b
1 0 2
2 1 0

nalimilan · 2018-03-07T16:40:29Z

I am not sure I understand, I think of dataframes as collections, so I would not broadcast replace.

The problem is that data frames can be seen either as collections of scalars, as collections of rows or as collections of columns. Currently we don't assume people want to treat them as collections of scalars. If we changed that, we would need a consistent approach, making data frames more similar to matrices for all relevant functions (e.g. sum(df), sum(df, 2), exp.(df), etc.). That could make sense, but that requires a careful design and discussion. Barring that, I'm not sure we should add APIs piecemeal, or we will end with a very complex system that nobody masters.

bkamins · 2019-12-02T20:46:49Z

One can currently write:

ifelse.(ismissing.(df), some_replacement, df)

maybe it is not the most friendly, but I guess it should be good enough?

nalimilan · 2019-12-04T13:17:24Z

Do you think we could allow replace.(df, missing => replacement) or replace.(df, Ref(missing => replacement))?

bkamins · 2019-12-04T20:07:31Z

I have considered this option before writing the ifelse recommendation. The problem is that replace is vectorized by itself so:

julia> x = [1,2,3]
3-element Array{Int64,1}:
 1
 2
 3

julia> replace.(x, 1=>10)
ERROR: MethodError: no method matching similar(::Int64, ::Type{Int64})

So if we wanted to add it we should define:

replace(df::AbstractDataFrame, replacements; cols, inplace=false) =
    replace!(copy(df), replacements; cols=cols, inplace=inplace)

function replace!(df::AbstractDataFrame, replacements; cols, inplace=false)
    for col in index(df)[cols]
        if inlpace
            replace!(df[!, col], replacements...)
        else
            df[!, col] = replace(df[!, col], replacements...)
        end
    end
end

and inplace decides if we want to mutate the columns or replace them (this is relevant especially when replacement value has a type that is not allowed by the source column).

nalimilan mentioned this issue Sep 20, 2018

Review row vs. column orientation of API #1514

Closed

6 tasks

bkamins mentioned this issue Jan 15, 2019

DataFrames.jl roadmap #1678

Closed

31 tasks

bkamins added the non-breaking The proposed change is not breaking label Feb 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

define Missings.replace method #1377

define Missings.replace method #1377

tpapp commented Mar 7, 2018

nalimilan commented Mar 7, 2018

tpapp commented Mar 7, 2018

nalimilan commented Mar 7, 2018

tpapp commented Mar 7, 2018 •

edited

Loading

nalimilan commented Mar 7, 2018

tpapp commented Mar 7, 2018

nalimilan commented Mar 7, 2018

bkamins commented Dec 2, 2019

nalimilan commented Dec 4, 2019

bkamins commented Dec 4, 2019

define Missings.replace method #1377

define Missings.replace method #1377

Comments

tpapp commented Mar 7, 2018

nalimilan commented Mar 7, 2018

tpapp commented Mar 7, 2018

nalimilan commented Mar 7, 2018

tpapp commented Mar 7, 2018 • edited Loading

nalimilan commented Mar 7, 2018

tpapp commented Mar 7, 2018

nalimilan commented Mar 7, 2018

bkamins commented Dec 2, 2019

nalimilan commented Dec 4, 2019

bkamins commented Dec 4, 2019

tpapp commented Mar 7, 2018 •

edited

Loading