Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functions for column- and row-wise processing #956

Closed
abbradar opened this issue May 13, 2016 · 11 comments
Closed

Functions for column- and row-wise processing #956

abbradar opened this issue May 13, 2016 · 11 comments

Comments

@abbradar
Copy link

abbradar commented May 13, 2016

Hi,

It would be nice to have a family of functions for row- and column-wise processing of a DataFrame. This can be useful e.g. for various normalizing operations. What I envision:

mapcol(f :: Function, df :: DataFrame) -> DataFrame
mapcol!(f :: Function, df :: DataFrame)
maprow(f :: Function, df :: DataFrame) -> DataFrame
maprow!(f :: Function, df :: DataFrame)

, where f is Array -> Array. They would be trivial to implement, but also very useful. If we agree on details (i.e. names and general interest in this) I can provide a PR. Similar functions exist in R and Python's pandas, but DataFrame is always two-dimensional so I think two distinct functions would serve us better. Example implementation of mapcol! to show my idea (credits to Ismael-VC from Julia's Gitter room where I asked if there is already such a function):

function mapcol!(f :: Function, df :: DataFrame)
    for (name, column) in eachcol(df)
        df[name] = f(column)
    end
end

Last but not least, I'm a newcomer and may have just missed some way that already exists to do this. If so, I apologize!

EDIT: A little bikeshedding: perhaps we want them to be named apply*, not map*.

@abbradar abbradar changed the title Function for column-wise processing Functions for column- and row-wise processing May 13, 2016
@quinnj
Copy link
Member

quinnj commented Sep 7, 2017

@cjprybol, do you think we cover this pretty well now?

@cjprybol
Copy link
Contributor

I think these would be nice to have, thank you for the suggestion @abbradar! I couldn't think of an obvious way to do this without writing a for-loop, which I imagine would be alienating for users who prefer to use map instead of for-loop iteration. As stated above:

They would be trivial to implement, but also very useful

So, 👍

If you are still interested in opening a PR with these changes @abbradar, please do so!

@nalimilan
Copy link
Member

I'm not sure this would be a good idea, as these functions would encourage users to write vectorized code when an in-place element-wise operation would be possible. Julia is much powerful than R and Pandas (which require vectorized functions for performance) in that regard, so we don't need to implement the same APIs.

Can you give examples of cases where you would like to use these functions? That would be helpful to see what would be the best API to do this both conveniently and efficiently.

@abbradar
Copy link
Author

abbradar commented Sep 12, 2017

@cjprybol I have very little time on my hands lately but I'm interested. Just don't expect anything in a month at least :D (if anybody else wants to implement this, I'm of course okay!)

@nalimilan I can't find an example right now but I bet it was something similar to:

    for (name, column) in eachcol(df)
        df[name] = df[name] / sum(df[name])
    end

except I don't like for-loops.

P.S. Notice that my Julia is a bit rusty now so this code might not work -- but I hope you got the idea.

@nalimilan
Copy link
Member

Thanks for the example. For this kind of use case, a more efficient in-place version can currently be written like this (assuming the columns are already floating point):

df = DataFrame(a=[1.0, 2.0], b=[3.0, 4.0])
foreach(col -> scale!(col[2], 1/sum(col[2])), eachcol(df))
# Or:
foreach(col -> col[2] .= col[2] ./ sum(col[2]), eachcol(df))

Of course it's not ideal. I guess something like mapcol! would make this slightly more concise by not passing the name of the column as the first element of the tuple. It would also allow non-mutating operation where needed (e.g. if the input column is integer so that it can't hold the result). That could work like this:

# In-place
mapcol!(col -> scale!(col, 1/sum(col)), df)
mapcol!(col -> col .= col ./ sum(col), df)

# Copying
mapcol!(col -> col/sum(col)), df)

So maybe that would be useful. It's annoying that the in-place version is longer than the copying version, which means people will probably use the latter by default even if it's less efficient.

Yet another approach would be to use the broadcast mechanism, considering data frames as matrix-like. The above example could be written like this (this would only work if colwise returned a row vector, currently we need transpose(colwise(mean, df))):

# In-place
broadcast!(/, df, df, colwise(sum, df))
df .= df ./ colwise(sum, df)

# Copying
df2 = df ./ colwise(sum, df)

Both approaches could be implemented at the same time of each as its merits.

@pdeffebach
Copy link
Contributor

It seems like a rowwise operator would be good here.

vars = [:x1, :x2, :x3]
df[:newvar] = rowwise(mean, df[vars])

This is also useful because it reduces the urge to treat a dataframe like a matrix, since mapslices doesn't, and shouldn't, work on dataframes.

I will try to put together a PR. Though I wonder if this implementation is most efficient with an eachrow for loop from #1449

@nalimilan
Copy link
Member

Maybe define reduce(mean, df[vars], dims=2) for that? In terms of efficiency, the only way to make it fast is to internally dispatch this to a function taking a tuple of columns, which will be specialized on their number and types.

@pdeffebach
Copy link
Contributor

Let's move this discussion to #1459. Are you giving the go-ahead for rowwise or that this operation should remain an overloading of reduce? It's no big deal either way, but I'd thought I'd submit a concrete proposal for discussion.

@nalimilan
Copy link
Member

I'm not sure. I think I'd prefer that we develop a comprehensive API proposal for this whole issue of functions operating on columns vs. on rows, and whether we want to use the same API as matrices or something completely different. That's very similar to the question of whether nrow/ncol should be provided instead of size (#1200).

@JuliaData JuliaData deleted a comment from skanskan Oct 18, 2018
@nalimilan
Copy link
Member

We now have mapcols/mapcols!, and it could make sense to have map/map! to operate over rows. See #1514.

@bkamins bkamins mentioned this issue Jan 15, 2019
31 tasks
@nalimilan
Copy link
Member

Now that mapcols and broadcast are implemented for data frames, all constructs mentioned at #956 (comment) are supported. We may consider adding map to operate over rows, but map(f, eachrow(df)) works so I'll close this issue for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants