Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added pivot() funtion (for updated pull request #1175) #1181

Closed
wants to merge 1 commit into from

Conversation

sylvaticus
Copy link
Contributor

While the implementation of pull request #1175 was done directly online on github.com, this is done in the more classic clone, commit & pull way (as I wasn't able to add files to the pull request online).

This commit implements the original pivot() function that was the subject of pull request #1175 and the subsequent comments by @ararslan:

  • added test
  • used lowercase
  • used 4 spaces idents
  • removed parenthesis around if statements
  • reformatted docstring (following those of unstack())

While the original pull request JuliaData#1175 was done directly online on github, this is done in the more classic clone, commit & pull way.
This commit implements the original pivot() function and the comments by @ararslan:
- added test
- used lowercase
- 4 spaces idents
- removed parenthesis around if statements
- reformatted docstring (following those of unstack())
@nalimilan
Copy link
Member

See Panda's pivot_table function for previous art from which we could take inspiration (see the end of this page).

@sylvaticus
Copy link
Contributor Author

Hi.. thank you for your comment.. actually the proposed pivot() function does behave much like the pivot_table() panda function:
Differences:

  • it can take only one value column, as julia DataFrames do not support multiindex;
  • conversely, the aggregation function (parameter ops) support multiple operations, e.g. you may want both sum and count, and it defaults to sum;
  • I included optional keyword arguments for filtering and sorting, as this is also very common and it's provided in equivalent spreadsheet software.

I don't think two separate pivot/pivot_table functions would be needed, as one can use ops=count to see if there are multiple rows with the same column indexes..

@nalimilan
Copy link
Member

nalimilan commented Jun 24, 2017

Sorry for the delay. I'm fine with adding pivot, though here's list of remarks/questions to address:

  • Are we sure we don't need an equivalent of Pandas' pivot? Does our stack fill all the use cases of pivot? Else, we could follow Pandas' pivot behavior (i.e. fill cells with the value from the corresponding row, without aggregation) by default, and only perform an aggregation if an aggregation function is provided. (AFAICT passing ops=count is very different from Pandas' pivot.)
  • The default to sum isn't consistent with Pandas, which uses mean. Unless we have strong reasons to diverge, better use the same default I would say. We could also have no default at all, as in colwise; at least that would be explicit (and we can always change it later).
  • I don't like the presence of filtering and sorting arguments: if we start adding them here, we could also add them elsewhere and it makes the API more complex. These operations can easily be applied separately in a series of chained operations.
  • OTC it would be nice to support the margins argument, though that's not a strict requirement as it can be added later.

Regarding the implementation:

  • It would be more efficient to call groupby once, and then call aggregate on the returned object (WIP: Modify aggregate for efficiency DataTables.jl#65 should make it much faster, could be backported to DataFrames), rather than calling by repeatedly.
  • Better make ops a positional argument, that will make it easier to handle the case of a single function and that of a vector of functions (see the example of colwise).
  • Please try to follow more closely the conventions used in the codebase regarding in particular lowercase identifiers, names of arguments/vocabulary, maximum line with of 92 chars, empty lines, spacing, indentation, position of line breaks. And be consistent inside a function.

EDIT: pivot should also be mentioned in the Documenter manual.

@bkamins bkamins mentioned this pull request Jan 15, 2019
31 tasks
@bkamins
Copy link
Member

bkamins commented Dec 2, 2019

@nalimilan - I think I would prefer to have pivot in FreqTables.jl and close the issue here. In this way we can have more than two dimensions of aggregation and more than one source => aggregator options. Any opinion on this?

@nalimilan
Copy link
Member

nalimilan commented Dec 2, 2019

Well FreqTables is for frequency tables, but no frequencies are involved here.

@bkamins
Copy link
Member

bkamins commented Dec 2, 2019

I know, but "frequency" is just length, and it could be substituted by any aggregation function and all else could stay essentially the same (so we could "stretch" FreqTables.jl to cover more general pivoting, but I agree that the name is not ideal).
Otherwise maybe we should have a PivotTables.jl package that would be built on top of Tables.jl so that it would process any tabular data (not necessairly DataFrames.jl).

It all depends how much you want to invest into FreqTables.jl.

@nalimilan
Copy link
Member

Yeah I guess we could have a function which makes sense for frequencies, but also happens to be more generally useful. Though it could also make sense to have it in DataFrames. It's hard to write it using the Tables.jl API anyway since it relies on many DataFrames-specific functions (AFAICT).

@bkamins
Copy link
Member

bkamins commented Dec 2, 2019

We could add pivot to DataFrames.jl, but what I think is that it should not produce a DataFrame but rather an array with named axes. And the point is that I would prefer to avoid adding such a dependency to DataFrames.jl.

Alternatively I can implement a limited functionality similar to proposed here in DataFrames.jl (i.e. producing a DataFrame). But then I would close this PR and open a new one - to make it consistent with the current design of the package. Would you prefer this (this will be a relatively simple PR).

@bkamins
Copy link
Member

bkamins commented Dec 2, 2019

Also the question is if we need it as it is essentially an unstack of by?

@nalimilan
Copy link
Member

I don't have a strong opinion. I guess we can wait until somebody cares enough to make a PR.

@bkamins
Copy link
Member

bkamins commented Dec 2, 2019

I will add a description of by-unstack combo in my tutorial.

@bkamins
Copy link
Member

bkamins commented Jan 31, 2022

Closed in favor of #2998.

@bkamins bkamins closed this Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
non-breaking The proposed change is not breaking reshaping
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants