Added pivot() funtion (for updated pull request #1175) #1181

sylvaticus · 2017-04-14T10:26:07Z

While the implementation of pull request #1175 was done directly online on github.com, this is done in the more classic clone, commit & pull way (as I wasn't able to add files to the pull request online).

This commit implements the original pivot() function that was the subject of pull request #1175 and the subsequent comments by @ararslan:

added test
used lowercase
used 4 spaces idents
removed parenthesis around if statements
reformatted docstring (following those of unstack())

@ararslan

While the original pull request JuliaData#1175 was done directly online on github, this is done in the more classic clone, commit & pull way. This commit implements the original pivot() function and the comments by @ararslan: - added test - used lowercase - 4 spaces idents - removed parenthesis around if statements - reformatted docstring (following those of unstack())

nalimilan · 2017-05-04T08:03:52Z

See Panda's pivot_table function for previous art from which we could take inspiration (see the end of this page).

sylvaticus · 2017-05-04T08:47:56Z

Hi.. thank you for your comment.. actually the proposed pivot() function does behave much like the pivot_table() panda function:
Differences:

it can take only one value column, as julia DataFrames do not support multiindex;
conversely, the aggregation function (parameter ops) support multiple operations, e.g. you may want both sum and count, and it defaults to sum;
I included optional keyword arguments for filtering and sorting, as this is also very common and it's provided in equivalent spreadsheet software.

I don't think two separate pivot/pivot_table functions would be needed, as one can use ops=count to see if there are multiple rows with the same column indexes..

nalimilan · 2017-06-24T14:44:53Z

Sorry for the delay. I'm fine with adding pivot, though here's list of remarks/questions to address:

Are we sure we don't need an equivalent of Pandas' pivot? Does our stack fill all the use cases of pivot? Else, we could follow Pandas' pivot behavior (i.e. fill cells with the value from the corresponding row, without aggregation) by default, and only perform an aggregation if an aggregation function is provided. (AFAICT passing ops=count is very different from Pandas' pivot.)
The default to sum isn't consistent with Pandas, which uses mean. Unless we have strong reasons to diverge, better use the same default I would say. We could also have no default at all, as in colwise; at least that would be explicit (and we can always change it later).
I don't like the presence of filtering and sorting arguments: if we start adding them here, we could also add them elsewhere and it makes the API more complex. These operations can easily be applied separately in a series of chained operations.
OTC it would be nice to support the margins argument, though that's not a strict requirement as it can be added later.

Regarding the implementation:

It would be more efficient to call groupby once, and then call aggregate on the returned object (WIP: Modify aggregate for efficiency DataTables.jl#65 should make it much faster, could be backported to DataFrames), rather than calling by repeatedly.
Better make ops a positional argument, that will make it easier to handle the case of a single function and that of a vector of functions (see the example of colwise).
Please try to follow more closely the conventions used in the codebase regarding in particular lowercase identifiers, names of arguments/vocabulary, maximum line with of 92 chars, empty lines, spacing, indentation, position of line breaks. And be consistent inside a function.

EDIT: pivot should also be mentioned in the Documenter manual.

bkamins · 2019-12-02T08:23:14Z

@nalimilan - I think I would prefer to have pivot in FreqTables.jl and close the issue here. In this way we can have more than two dimensions of aggregation and more than one source => aggregator options. Any opinion on this?

nalimilan · 2019-12-02T15:05:21Z

Well FreqTables is for frequency tables, but no frequencies are involved here.

bkamins · 2019-12-02T15:13:04Z

I know, but "frequency" is just length, and it could be substituted by any aggregation function and all else could stay essentially the same (so we could "stretch" FreqTables.jl to cover more general pivoting, but I agree that the name is not ideal).
Otherwise maybe we should have a PivotTables.jl package that would be built on top of Tables.jl so that it would process any tabular data (not necessairly DataFrames.jl).

It all depends how much you want to invest into FreqTables.jl.

nalimilan · 2019-12-02T17:53:32Z

Yeah I guess we could have a function which makes sense for frequencies, but also happens to be more generally useful. Though it could also make sense to have it in DataFrames. It's hard to write it using the Tables.jl API anyway since it relies on many DataFrames-specific functions (AFAICT).

bkamins · 2019-12-02T19:50:17Z

We could add pivot to DataFrames.jl, but what I think is that it should not produce a DataFrame but rather an array with named axes. And the point is that I would prefer to avoid adding such a dependency to DataFrames.jl.

Alternatively I can implement a limited functionality similar to proposed here in DataFrames.jl (i.e. producing a DataFrame). But then I would close this PR and open a new one - to make it consistent with the current design of the package. Would you prefer this (this will be a relatively simple PR).

bkamins · 2019-12-02T20:13:06Z

Also the question is if we need it as it is essentially an unstack of by?

nalimilan · 2019-12-02T20:37:34Z

I don't have a strong opinion. I guess we can wait until somebody cares enough to make a PR.

bkamins · 2019-12-02T20:42:15Z

I will add a description of by-unstack combo in my tutorial.

bkamins · 2022-01-31T10:41:32Z

Closed in favor of #2998.

sylvaticus mentioned this pull request Apr 14, 2017

Add spreadsheet-like pivot() function #1175

Closed

cjprybol mentioned this pull request Sep 19, 2017

More intuitive functions #1234

Closed

bkamins mentioned this pull request Jan 15, 2019

DataFrames.jl roadmap #1678

Closed

31 tasks

bkamins added the non-breaking The proposed change is not breaking label Feb 12, 2020

bkamins added the reshaping label Apr 25, 2021

bkamins mentioned this pull request Apr 25, 2021

requesting new feature which covers stack, unstack and permutedims in a simpler way (at least conceptually) #2732

Closed

sl-solution mentioned this pull request May 1, 2021

Transposing DataFrame #2743

Closed

This was referenced Jan 24, 2022

allow no rowkey in unstack #2995

Merged

allow function in allowduplicates in unstack #2998

Merged

bkamins closed this Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added pivot() funtion (for updated pull request #1175) #1181

Added pivot() funtion (for updated pull request #1175) #1181

sylvaticus commented Apr 14, 2017

nalimilan commented May 4, 2017

sylvaticus commented May 4, 2017

nalimilan commented Jun 24, 2017 •

edited

Loading

bkamins commented Dec 2, 2019

nalimilan commented Dec 2, 2019 •

edited

Loading

bkamins commented Dec 2, 2019

nalimilan commented Dec 2, 2019

bkamins commented Dec 2, 2019

bkamins commented Dec 2, 2019

nalimilan commented Dec 2, 2019

bkamins commented Dec 2, 2019

bkamins commented Jan 31, 2022

Added pivot() funtion (for updated pull request #1175) #1181

Added pivot() funtion (for updated pull request #1175) #1181

Conversation

sylvaticus commented Apr 14, 2017

nalimilan commented May 4, 2017

sylvaticus commented May 4, 2017

nalimilan commented Jun 24, 2017 • edited Loading

bkamins commented Dec 2, 2019

nalimilan commented Dec 2, 2019 • edited Loading

bkamins commented Dec 2, 2019

nalimilan commented Dec 2, 2019

bkamins commented Dec 2, 2019

bkamins commented Dec 2, 2019

nalimilan commented Dec 2, 2019

bkamins commented Dec 2, 2019

bkamins commented Jan 31, 2022

nalimilan commented Jun 24, 2017 •

edited

Loading

nalimilan commented Dec 2, 2019 •

edited

Loading