Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

define Missings.replace method #1377

Open
tpapp opened this issue Mar 7, 2018 · 10 comments
Open

define Missings.replace method #1377

tpapp opened this issue Mar 7, 2018 · 10 comments
Labels
non-breaking The proposed change is not breaking

Comments

@tpapp
Copy link

tpapp commented Mar 7, 2018

In light of this discussion and similar ones, it may make sense to define something along the lines of

function Missings.replace(df::DataFrame, value)
    map(col -> collect(Missings.replace(col, value)), eachcol(df))
end
@nalimilan
Copy link
Member

Missings.replace is lazy, so I don't think that's appropriate for a data frame. I don't have a good solution to suggest...

@tpapp
Copy link
Author

tpapp commented Mar 7, 2018

The (::Any, ::Any) method of Missings.replace is indeed lazy and documented as such, but I am not sure that insisting on lazyness is the right approach for all data types. Does it have to be lazy everywhere? (Should I open an issue for Missings? I guess it would reach the same people).

@nalimilan
Copy link
Member

Well, the idea is that we have replace(x, missing=>rep) to get a copy of the input vector on Julia 0.7 (JuliaLang/julia#26206 will improve the return type), so the reason why Missings.replace exists is that it's lazy, like Missings.skip and Missings.fail. I would find it confusing to mix lazy and non-lazy semantics in the same function. I don't think we have functions with mixed semantics like that in Base

@tpapp
Copy link
Author

tpapp commented Mar 7, 2018

This sounds reasonable. Perhaps a method for

Base.replace(df::AbstractDataFrame, pat => value)

then? AFAICT that is not required to be lazy.

@nalimilan
Copy link
Member

Yeah, but currently we don't implement any methods to apply an operation to all columns, you need to do that manually. If we start doing that, why not also implement exp or any other standard function? It could make sense to use broadcasting for that, via something like f.(columns(df)) or f.(rows(df)). So that would give replace.(df, missing => rep). Obviously that's much more ambitious, though, but API consistency really matters.

@tpapp
Copy link
Author

tpapp commented Mar 7, 2018

So that would give replace.(df, missing => rep)

I am not sure I understand, I think of dataframes as collections, so I would not broadcast replace.

API consistency really matters

I agree, but currently it is somewhat convoluted to do some simple operations. Cf R,

> df = data.frame(a = c(NA, 1), b = c(2, NA))
> df[is.na(df)] = 0
> df
  a b
1 0 2
2 1 0

@nalimilan
Copy link
Member

I am not sure I understand, I think of dataframes as collections, so I would not broadcast replace.

The problem is that data frames can be seen either as collections of scalars, as collections of rows or as collections of columns. Currently we don't assume people want to treat them as collections of scalars. If we changed that, we would need a consistent approach, making data frames more similar to matrices for all relevant functions (e.g. sum(df), sum(df, 2), exp.(df), etc.). That could make sense, but that requires a careful design and discussion. Barring that, I'm not sure we should add APIs piecemeal, or we will end with a very complex system that nobody masters.

@bkamins
Copy link
Member

bkamins commented Dec 2, 2019

One can currently write:

ifelse.(ismissing.(df), some_replacement, df)

maybe it is not the most friendly, but I guess it should be good enough?

@nalimilan
Copy link
Member

Do you think we could allow replace.(df, missing => replacement) or replace.(df, Ref(missing => replacement))?

@bkamins
Copy link
Member

bkamins commented Dec 4, 2019

I have considered this option before writing the ifelse recommendation. The problem is that replace is vectorized by itself so:

julia> x = [1,2,3]
3-element Array{Int64,1}:
 1
 2
 3

julia> replace.(x, 1=>10)
ERROR: MethodError: no method matching similar(::Int64, ::Type{Int64})

So if we wanted to add it we should define:

replace(df::AbstractDataFrame, replacements; cols, inplace=false) =
    replace!(copy(df), replacements; cols=cols, inplace=inplace)

function replace!(df::AbstractDataFrame, replacements; cols, inplace=false)
    for col in index(df)[cols]
        if inlpace
            replace!(df[!, col], replacements...)
        else
            df[!, col] = replace(df[!, col], replacements...)
        end
    end
end

and inplace decides if we want to mutate the columns or replace them (this is relevant especially when replacement value has a type that is not allowed by the source column).

@bkamins bkamins added the non-breaking The proposed change is not breaking label Feb 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
non-breaking The proposed change is not breaking
Projects
None yet
Development

No branches or pull requests

3 participants