Functions for column- and row-wise processing #956

abbradar · 2016-05-13T00:56:56Z

Hi,

It would be nice to have a family of functions for row- and column-wise processing of a DataFrame. This can be useful e.g. for various normalizing operations. What I envision:

mapcol(f :: Function, df :: DataFrame) -> DataFrame
mapcol!(f :: Function, df :: DataFrame)
maprow(f :: Function, df :: DataFrame) -> DataFrame
maprow!(f :: Function, df :: DataFrame)

, where f is Array -> Array. They would be trivial to implement, but also very useful. If we agree on details (i.e. names and general interest in this) I can provide a PR. Similar functions exist in R and Python's pandas, but DataFrame is always two-dimensional so I think two distinct functions would serve us better. Example implementation of mapcol! to show my idea (credits to Ismael-VC from Julia's Gitter room where I asked if there is already such a function):

function mapcol!(f :: Function, df :: DataFrame)
    for (name, column) in eachcol(df)
        df[name] = f(column)
    end
end

Last but not least, I'm a newcomer and may have just missed some way that already exists to do this. If so, I apologize!

EDIT: A little bikeshedding: perhaps we want them to be named apply*, not map*.

The text was updated successfully, but these errors were encountered:

quinnj · 2017-09-07T05:07:50Z

@cjprybol, do you think we cover this pretty well now?

cjprybol · 2017-09-11T23:51:15Z

I think these would be nice to have, thank you for the suggestion @abbradar! I couldn't think of an obvious way to do this without writing a for-loop, which I imagine would be alienating for users who prefer to use map instead of for-loop iteration. As stated above:

They would be trivial to implement, but also very useful

So, 👍

If you are still interested in opening a PR with these changes @abbradar, please do so!

nalimilan · 2017-09-12T11:26:41Z

I'm not sure this would be a good idea, as these functions would encourage users to write vectorized code when an in-place element-wise operation would be possible. Julia is much powerful than R and Pandas (which require vectorized functions for performance) in that regard, so we don't need to implement the same APIs.

Can you give examples of cases where you would like to use these functions? That would be helpful to see what would be the best API to do this both conveniently and efficiently.

abbradar · 2017-09-12T11:55:35Z

@cjprybol I have very little time on my hands lately but I'm interested. Just don't expect anything in a month at least :D (if anybody else wants to implement this, I'm of course okay!)

@nalimilan I can't find an example right now but I bet it was something similar to:

    for (name, column) in eachcol(df)
        df[name] = df[name] / sum(df[name])
    end

except I don't like for-loops.

P.S. Notice that my Julia is a bit rusty now so this code might not work -- but I hope you got the idea.

nalimilan · 2017-09-12T20:26:42Z

Thanks for the example. For this kind of use case, a more efficient in-place version can currently be written like this (assuming the columns are already floating point):

df = DataFrame(a=[1.0, 2.0], b=[3.0, 4.0])
foreach(col -> scale!(col[2], 1/sum(col[2])), eachcol(df))
# Or:
foreach(col -> col[2] .= col[2] ./ sum(col[2]), eachcol(df))

Of course it's not ideal. I guess something like mapcol! would make this slightly more concise by not passing the name of the column as the first element of the tuple. It would also allow non-mutating operation where needed (e.g. if the input column is integer so that it can't hold the result). That could work like this:

# In-place
mapcol!(col -> scale!(col, 1/sum(col)), df)
mapcol!(col -> col .= col ./ sum(col), df)

# Copying
mapcol!(col -> col/sum(col)), df)

So maybe that would be useful. It's annoying that the in-place version is longer than the copying version, which means people will probably use the latter by default even if it's less efficient.

Yet another approach would be to use the broadcast mechanism, considering data frames as matrix-like. The above example could be written like this (this would only work if colwise returned a row vector, currently we need transpose(colwise(mean, df))):

# In-place
broadcast!(/, df, df, colwise(sum, df))
df .= df ./ colwise(sum, df)

# Copying
df2 = df ./ colwise(sum, df)

Both approaches could be implemented at the same time of each as its merits.

pdeffebach · 2018-07-19T00:02:27Z

It seems like a rowwise operator would be good here.

vars = [:x1, :x2, :x3]
df[:newvar] = rowwise(mean, df[vars])

This is also useful because it reduces the urge to treat a dataframe like a matrix, since mapslices doesn't, and shouldn't, work on dataframes.

I will try to put together a PR. Though I wonder if this implementation is most efficient with an eachrow for loop from #1449

nalimilan · 2018-07-19T14:25:37Z

Maybe define reduce(mean, df[vars], dims=2) for that? In terms of efficiency, the only way to make it fast is to internally dispatch this to a function taking a tuple of columns, which will be specialized on their number and types.

pdeffebach · 2018-07-19T15:40:11Z

Let's move this discussion to #1459. Are you giving the go-ahead for rowwise or that this operation should remain an overloading of reduce? It's no big deal either way, but I'd thought I'd submit a concrete proposal for discussion.

nalimilan · 2018-07-21T22:30:42Z

I'm not sure. I think I'd prefer that we develop a comprehensive API proposal for this whole issue of functions operating on columns vs. on rows, and whether we want to use the same API as matrices or something completely different. That's very similar to the question of whether nrow/ncol should be provided instead of size (#1200).

nalimilan · 2018-12-26T13:51:28Z

We now have mapcols/mapcols!, and it could make sense to have map/map! to operate over rows. See #1514.

nalimilan · 2019-09-03T10:12:03Z

Now that mapcols and broadcast are implemented for data frames, all constructs mentioned at #956 (comment) are supported. We may consider adding map to operate over rows, but map(f, eachrow(df)) works so I'll close this issue for now.

abbradar changed the title ~~Function for column-wise processing~~ Functions for column- and row-wise processing May 13, 2016

quinnj mentioned this issue Sep 7, 2017

Apply functions to rows with NAs. #1114

Closed

JuliaData deleted a comment from skanskan Oct 18, 2018

bkamins mentioned this issue Jan 15, 2019

DataFrames.jl roadmap #1678

Closed

31 tasks

nalimilan closed this as completed Sep 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Functions for column- and row-wise processing #956

Functions for column- and row-wise processing #956

abbradar commented May 13, 2016 •

edited

Loading

quinnj commented Sep 7, 2017

cjprybol commented Sep 11, 2017

nalimilan commented Sep 12, 2017

abbradar commented Sep 12, 2017 •

edited

Loading

nalimilan commented Sep 12, 2017

pdeffebach commented Jul 19, 2018

nalimilan commented Jul 19, 2018

pdeffebach commented Jul 19, 2018

nalimilan commented Jul 21, 2018

nalimilan commented Dec 26, 2018

nalimilan commented Sep 3, 2019

Functions for column- and row-wise processing #956

Functions for column- and row-wise processing #956

Comments

abbradar commented May 13, 2016 • edited Loading

quinnj commented Sep 7, 2017

cjprybol commented Sep 11, 2017

nalimilan commented Sep 12, 2017

abbradar commented Sep 12, 2017 • edited Loading

nalimilan commented Sep 12, 2017

pdeffebach commented Jul 19, 2018

nalimilan commented Jul 19, 2018

pdeffebach commented Jul 19, 2018

nalimilan commented Jul 21, 2018

nalimilan commented Dec 26, 2018

nalimilan commented Sep 3, 2019

abbradar commented May 13, 2016 •

edited

Loading

abbradar commented Sep 12, 2017 •

edited

Loading