From cd432c0298523f85e4f5507a078841e38bab1d5b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Bogumi=C5=82=20Kami=C5=84ski?= Date: Wed, 16 Nov 2022 19:51:53 +0100 Subject: [PATCH 01/13] explain context dependent expressions --- docs/src/man/split_apply_combine.md | 337 +++++++++++++++++++++++++++- src/abstractdataframe/selection.jl | 4 +- 2 files changed, 338 insertions(+), 3 deletions(-) diff --git a/docs/src/man/split_apply_combine.md b/docs/src/man/split_apply_combine.md index 5e34a2e175..56217f23de 100644 --- a/docs/src/man/split_apply_combine.md +++ b/docs/src/man/split_apply_combine.md @@ -67,7 +67,7 @@ each subset of the `DataFrame`. This specification can be of the following forms except `AsTable` are allowed). 4. a `col => target_cols` pair, which renames the column `col` to `target_cols`, which must be single name (as a `Symbol` or a string), a vector of names or `AsTable`. -5. special convenience forms `function => target_cols` or just `function` +5. context dependent expressions `function => target_cols` or just `function` for specific `function`s where the input columns are omitted; without `target_cols` the new column has the same name as `function`, otherwise it must be single name (as a `Symbol` or a string). Supported `function`s are: @@ -778,3 +778,338 @@ julia> df 5 │ 2 missing 5 6 │ 3 missing 6 ``` + +# Context dependent expressions + +Operation specification language supports the following context dependent +operations: + +* getting the number of rows (`nrow`); +* getting the proportion of rows (`proprow`); +* getting the group number (`groupindices`); +* getting a vector of group indices (`eachindex`). + +These operations are context dependent, because they do not require input column +name in the operation specification syntax. + +These four exceptions to the standard operation specification syntax were +introduced for user convenience as these operations are often needed in +practice. + +Below each of them is explained by example. + +First create a data frame we will work with: + +```jldoctest sac +julia> df = DataFrame(customer_id=["a", "b", "b", "b", "c", "c"], + transaction_id=[12, 15, 19, 17, 13, 11], + volume=[2, 3, 1, 4, 5, 9]) +6×3 DataFrame + Row │ customer_id transaction_id volume + │ String Int64 Int64 +─────┼───────────────────────────────────── + 1 │ a 12 2 + 2 │ b 15 3 + 3 │ b 19 1 + 4 │ b 17 4 + 5 │ c 13 5 + 6 │ c 11 9 + +julia> gdf = groupby(df, :customer_id, sort=true); + +julia> show(gdf, allgroups=true) +GroupedDataFrame with 3 groups based on key: customer_id +Group 1 (1 row): customer_id = "a" + Row │ customer_id transaction_id volume + │ String Int64 Int64 +─────┼───────────────────────────────────── + 1 │ a 12 2 +Group 2 (3 rows): customer_id = "b" + Row │ customer_id transaction_id volume + │ String Int64 Int64 +─────┼───────────────────────────────────── + 1 │ b 15 3 + 2 │ b 19 1 + 3 │ b 17 4 +Group 3 (2 rows): customer_id = "c" + Row │ customer_id transaction_id volume + │ String Int64 Int64 +─────┼───────────────────────────────────── + 1 │ c 13 5 + 2 │ c 11 9 +``` + +## Getting the number of rows + +You can get the number of rows per group in a `GroupedDataFrame` by just +writing `nrow`, in which case the generated column name with the number of rows +is `:nrow`: + +```jldoctest sac +julia> combine(gdf, nrow) +3×2 DataFrame + Row │ customer_id nrow + │ String Int64 +─────┼──────────────────── + 1 │ a 1 + 2 │ b 3 + 3 │ c 2 +``` + +Additionally you are allowed to pass target column name: + +```jldoctest sac +julia> combine(gdf, nrow => "transaction_count") +3×2 DataFrame + Row │ customer_id transaction_count + │ String Int64 +─────┼──────────────────────────────── + 1 │ a 1 + 2 │ b 3 + 3 │ c 2 +``` + +Note that in both cases we did not pass source column name as it is not needed +to determine the number of rows per group. This is the reason why context +dependent expressions are exceptions to standard operation specification syntax. + +Additionally the `nrow` expression also works in operation specification syntax +applied to a data frame. Here is an example: + +```jldoctest sac +julia> combine(df, nrow => "transaction_count") +1×1 DataFrame + Row │ transaction_count + │ Int64 +─────┼─────────────────── + 1 │ 6 +``` + +Finally, recall that [`nrow`](@ref) is also a regular function that returns a +number of rows in a data frame: + + +```jldoctest sac +julia> nrow(df) +6 +``` + +This dual-use of `nrow` does not lead to ambiguities, and is meant to make it +easier to remember this exception. + +## Getting the proportion of rows + +If you want to get a proportion of rows per group in a `GroupedDataFrame` +you can use the `proprow` and `proprow => [target column name]` context +dependent expressions. Here are some examples: + +```jldoctest sac +julia> combine(gdf, proprow) +3×2 DataFrame + Row │ customer_id proprow + │ String Float64 +─────┼─────────────────────── + 1 │ a 0.166667 + 2 │ b 0.5 + 3 │ c 0.333333 + +julia> combine(gdf, proprow => "transaction_fraction") +3×2 DataFrame + Row │ customer_id transaction_fraction + │ String Float64 +─────┼─────────────────────────────────── + 1 │ a 0.166667 + 2 │ b 0.5 + 3 │ c 0.333333 +``` + +As opposed to `nrow`, `proprow` cannot be used outside of operation +specification syntax and is only allowed when processing `GroupedDataFrame`. + +## Getting the group number + +Another common operation is getting group number. Use the `groupindices` and +`groupindices => [target column name]` context dependent expressions to get it: + + +```jldoctest sac +julia> combine(gdf, groupindices) +3×2 DataFrame + Row │ customer_id groupindices + │ String Int64 +─────┼─────────────────────────── + 1 │ a 1 + 2 │ b 2 + 3 │ c 3 + +julia> combine(gdf, groupindices => "group_number") +3×2 DataFrame + Row │ customer_id group_number + │ String Int64 +─────┼─────────────────────────── + 1 │ a 1 + 2 │ b 2 + 3 │ c 3 +``` + +The `groupindices` name was chosen, because there exists the +[`groupindices`](@ref) function that applied to `GroupedDataFrame` returns +group indices for each row in the parent data frame of the passed +`GroupedDataFrame`: + +```jldoctest sac +julia> groupindices(gdf) +6-element Vector{Union{Missing, Int64}}: + 1 + 2 + 2 + 2 + 3 + 3 +``` + +So as for `nrow` we see that the result is similar, but just in a different +context (normal function call vs. operation specification syntax). + +## Getting a vector of group indices + +The last context dependent expression supported by operation is getting group +indices. Use the `eachindex` and `eachindex => [target column name]` expressions +to get it: + + +```jldoctest sac +julia> combine(gdf, eachindex) +6×2 DataFrame + Row │ customer_id eachindex + │ String Int64 +─────┼──────────────────────── + 1 │ a 1 + 2 │ b 1 + 3 │ b 2 + 4 │ b 3 + 5 │ c 1 + 6 │ c 2 + +julia> combine(gdf, eachindex => "transaction_number") +6×2 DataFrame + Row │ customer_id transaction_number + │ String Int64 +─────┼───────────────────────────────── + 1 │ a 1 + 2 │ b 1 + 3 │ b 2 + 4 │ b 3 + 5 │ c 1 + 6 │ c 2 +``` + +Note that this operation also makes sense in a data frame context so it is +also supported: + +```jldoctest sac +julia> transform(df, eachindex) +6×4 DataFrame + Row │ customer_id transaction_id volume eachindex + │ String Int64 Int64 Int64 +─────┼──────────────────────────────────────────────── + 1 │ a 12 2 1 + 2 │ b 15 3 2 + 3 │ b 19 1 3 + 4 │ b 17 4 4 + 5 │ c 13 5 5 + 6 │ c 11 9 6 +``` + +Finally recall that `eachindex` is a standard function for getting all indices +in an array. This similarity of functionality was the reason why this name was +picked: + +```jldoctest sac +julia> collect(eachindex(df.customer_id)) +6-element Vector{Int64}: + 1 + 2 + 3 + 4 + 5 + 6 +``` + +This, for example, means that in the following example the two created columns +have the same contents: + +```jldoctest sac +julia> combine(gdf, eachindex, :customer_id => eachindex) +6×3 DataFrame + Row │ customer_id eachindex customer_id_eachindex + │ String Int64 Int64 +─────┼─────────────────────────────────────────────── + 1 │ a 1 1 + 2 │ b 1 1 + 3 │ b 2 2 + 4 │ b 3 3 + 5 │ c 1 1 + 6 │ c 2 2 +``` + + +## Passing a function in operation specification syntax + +When discussing context dependent expressions it is important to remember +that operation specification syntax allows you to pass a function (without +source and target column names), in which case such a function get a +`SubDataFrame` that represents a group in a `GroupedDataFrame`. Here is an +example: + +```jldoctest sac +julia> combine(gdf, nrow, x -> nrow(x)) +3×3 DataFrame + Row │ customer_id nrow x1 + │ String Int64 Int64 +─────┼─────────────────────────── + 1 │ a 1 1 + 2 │ b 3 3 + 3 │ c 2 2 +``` + +Notice that columns `:nrow` and `:x1` have an identical contents. This is +expected. We already know that `nrow` is a context dependent expression +generating the `:nrow` column with number of rows per group. However, the +`x -> nrow(x)` anonymous function does exactly the same as it gets a +`SubDataFrame` as its argument and returns its number of rows (the `:x1` column +name is a default auto-generated column name in this case). + +To show you another example of passing a function consider the following case: + +```jldoctest sac +julia> combine(gdf, :volume => sum, x -> sum(x.volume)) +3×3 DataFrame + Row │ customer_id volume_sum x1 + │ String Int64 Int64 +─────┼──────────────────────────────── + 1 │ a 2 2 + 2 │ b 8 8 + 3 │ c 14 14 +``` + +Again, both `:volume_sum` and `:x1` columns hold the same data. The reason +is that in `:volume => sum` we just apply the `sum` function to the `:volume` +column, while in `x -> sum(x.volume`, `x` variable is a `SubDataFrame` +representing the whole group. + +Passing a function taking a `SubDataFrame` is a flexible functionality allowing +you to perform complex operations on your data. However, you should bear in mind +two aspects: + +* Using full operation specification syntax (where source and target column + names are passe) will lead to faster execution of your code (as Julia + compiler is able to better optimize execution of such operations) in + comparison to just passing a function taking a `SubDataFrame`. +* Although writing `row`, `proprow`, `groupindices`, and `eachindex` looks like + just passing a function they **do not** take a `SubDataFrame` as their + argument. As we explained in this section, they are special context dependent + expressions that are exceptions to the standard operation specification syntax + rules. They were added for user convenience (and at the same time they are + optimized to be fast). + diff --git a/src/abstractdataframe/selection.jl b/src/abstractdataframe/selection.jl index 9e32989c68..68bf4cd313 100644 --- a/src/abstractdataframe/selection.jl +++ b/src/abstractdataframe/selection.jl @@ -75,7 +75,7 @@ const TRANSFORMATION_COMMON_RULES = except `AsTable` are allowed). 4. a `col => target_cols` pair, which renames the column `col` to `target_cols`, which must be single name (as a `Symbol` or a string), a vector of names or `AsTable`. - 5. special convenience forms `function => target_cols` or just `function` + 5. context dependent expressions `function => target_cols` or just `function` for specific `function`s where the input columns are omitted; without `target_cols` the new column has the same name as `function`, otherwise it must be single name (as a `Symbol` or a string). Supported `function`s are: @@ -1267,7 +1267,7 @@ julia> select(gd, :, AsTable(Not(:a)) => sum, renamecols=false) 8 │ 2 1 8 9 ``` -# special convenience transformations +# context dependent expressions ```jldoctest julia> df = DataFrame(a=[1, 1, 1, 2, 2, 1, 1, 2], b=repeat([2, 1], outer=[4]), From 264ec2d55bd2ceeccd64201d901484892ff5eaab Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Bogumi=C5=82=20Kami=C5=84ski?= Date: Thu, 17 Nov 2022 14:32:29 +0100 Subject: [PATCH 02/13] explain that GroupedDataFrame is indexable and iterable --- docs/src/man/split_apply_combine.md | 331 ++++++++++++++++++---------- 1 file changed, 218 insertions(+), 113 deletions(-) diff --git a/docs/src/man/split_apply_combine.md b/docs/src/man/split_apply_combine.md index 56217f23de..6dc5d238b4 100644 --- a/docs/src/man/split_apply_combine.md +++ b/docs/src/man/split_apply_combine.md @@ -1,5 +1,7 @@ # The Split-Apply-Combine Strategy +## Design of the split-apply-combine support + Many data analysis tasks involve three steps: 1. splitting a data set into groups, 2. applying some functions to each of the groups, @@ -186,6 +188,8 @@ for details): - `threads` : whether transformations may be run in separate tasks which can execute in parallel +## Examples of the split-apply-combine operations + We show several examples of these functions applied to the `iris` dataset below: ```jldoctest sac @@ -385,7 +389,134 @@ julia> combine(gdf) do df 3 │ Iris-virginica 5.552 0.304588 ``` -If you only want to split the data set into subsets, use the [`groupby`](@ref) function: +To apply a function to each non-grouping column of a `GroupedDataFrame` you can write: + +```jldoctest sac +julia> gd = groupby(iris, :Species) +GroupedDataFrame with 3 groups based on key: Species +First Group (50 rows): Species = "Iris-setosa" + Row │ SepalLength SepalWidth PetalLength PetalWidth Species + │ Float64 Float64 Float64 Float64 String15 +─────┼─────────────────────────────────────────────────────────────── + 1 │ 5.1 3.5 1.4 0.2 Iris-setosa + 2 │ 4.9 3.0 1.4 0.2 Iris-setosa + ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ + 49 │ 5.3 3.7 1.5 0.2 Iris-setosa + 50 │ 5.0 3.3 1.4 0.2 Iris-setosa + 46 rows omitted +⋮ +Last Group (50 rows): Species = "Iris-virginica" + Row │ SepalLength SepalWidth PetalLength PetalWidth Species + │ Float64 Float64 Float64 Float64 String15 +─────┼────────────────────────────────────────────────────────────────── + 1 │ 6.3 3.3 6.0 2.5 Iris-virginica + 2 │ 5.8 2.7 5.1 1.9 Iris-virginica + ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ + 50 │ 5.9 3.0 5.1 1.8 Iris-virginica + 47 rows omitted + +julia> combine(gd, valuecols(gd) .=> mean) +3×5 DataFrame + Row │ Species SepalLength_mean SepalWidth_mean PetalLength_mean P ⋯ + │ String15 Float64 Float64 Float64 F ⋯ +─────┼────────────────────────────────────────────────────────────────────────── + 1 │ Iris-setosa 5.006 3.418 1.464 ⋯ + 2 │ Iris-versicolor 5.936 2.77 4.26 + 3 │ Iris-virginica 6.588 2.974 5.552 + 1 column omitted +``` + +Note that `GroupedDataFrame` is a view: therefore +grouping columns of its parent data frame must not be mutated, and +rows must not be added nor removed from it. If the number or rows +of the parent changes then an error is thrown when a child `GroupedDataFrame` +is used: +```jldoctest sac +julia> df = DataFrame(id=1:2) +2×1 DataFrame + Row │ id + │ Int64 +─────┼─────── + 1 │ 1 + 2 │ 2 + +julia> gd = groupby(df, :id) +GroupedDataFrame with 2 groups based on key: id +First Group (1 row): id = 1 + Row │ id + │ Int64 +─────┼─────── + 1 │ 1 +⋮ +Last Group (1 row): id = 2 + Row │ id + │ Int64 +─────┼─────── + 1 │ 2 + +julia> push!(df, [3]) +3×1 DataFrame + Row │ id + │ Int64 +─────┼─────── + 1 │ 1 + 2 │ 2 + 3 │ 3 + +julia> gd[1] +ERROR: AssertionError: The current number of rows in the parent data frame is 3 and it does not match the number of rows it contained when GroupedDataFrame was created which was 2. The number of rows in the parent data frame has likely been changed unintentionally (e.g. using subset!, filter!, deleteat!, push!, or append! functions). +``` + +Sometimes it is useful to append rows to the source data frame of a +`GroupedDataFrame`, without affecting the rows used for grouping. +In such a scenario you can create the grouped data frame using a `view` +of the parent data frame to avoid the error: + +```jldoctest sac +julia> df = DataFrame(id=1:2) +2×1 DataFrame + Row │ id + │ Int64 +─────┼─────── + 1 │ 1 + 2 │ 2 + +julia> gd = groupby(view(df, :, :), :id) +GroupedDataFrame with 2 groups based on key: id +First Group (1 row): id = 1 + Row │ id + │ Int64 +─────┼─────── + 1 │ 1 +⋮ +Last Group (1 row): id = 2 + Row │ id + │ Int64 +─────┼─────── + 1 │ 2 + +julia> push!(df, [3]) +3×1 DataFrame + Row │ id + │ Int64 +─────┼─────── + 1 │ 1 + 2 │ 2 + 3 │ 3 + +julia> gd[1] +1×1 SubDataFrame + Row │ id + │ Int64 +─────┼─────── + 1 │ 1 +``` + +## Using `GroupedDataFrame` as an itrable and indexable object + +If you only want to split the data set into subsets, use the [`groupby`](@ref) +function. You can then iterate `SubDataFrame`s that constitute the identified +groups: ```jldoctest sac julia> for subdf in groupby(iris, :Species) @@ -494,129 +625,103 @@ Last Group (5 rows): g = 501 5 │ 501 2505 ``` -In order to apply a function to each non-grouping column of a `GroupedDataFrame` you can write: +Note that although `GroupedDataFrame` is iterable and indexable it is not an +`AbstractVector`. For this reason currently it was designed that it does not +support `map` nor broadcasting (to allow for making a decision in the future +what result type they should produce). To apply a function to all groups of a +data frame and get a vector of results either use a comprehension or `collect` +`GroupedDataFrame` into a vector first. Here are examples of both approaches: + ```jldoctest sac -julia> gd = groupby(iris, :Species) -GroupedDataFrame with 3 groups based on key: Species -First Group (50 rows): Species = "Iris-setosa" - Row │ SepalLength SepalWidth PetalLength PetalWidth Species - │ Float64 Float64 Float64 Float64 String15 +julia> [nrow(sdf) for sdf in gd] +3-element Vector{Int64}: + 50 + 50 + 50 + +julia> sdf_vec = collect(gd) +3-element Vector{Any}: + 50×5 SubDataFrame + Row │ SepalLength SepalWidth PetalLength PetalWidth Species + │ Float64 Float64 Float64 Float64 String15 ─────┼─────────────────────────────────────────────────────────────── 1 │ 5.1 3.5 1.4 0.2 Iris-setosa 2 │ 4.9 3.0 1.4 0.2 Iris-setosa + 3 │ 4.7 3.2 1.3 0.2 Iris-setosa + 4 │ 4.6 3.1 1.5 0.2 Iris-setosa + 5 │ 5.0 3.6 1.4 0.2 Iris-setosa + 6 │ 5.4 3.9 1.7 0.4 Iris-setosa + 7 │ 4.6 3.4 1.4 0.3 Iris-setosa + 8 │ 5.0 3.4 1.5 0.2 Iris-setosa ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ + 44 │ 5.0 3.5 1.6 0.6 Iris-setosa + 45 │ 5.1 3.8 1.9 0.4 Iris-setosa + 46 │ 4.8 3.0 1.4 0.3 Iris-setosa + 47 │ 5.1 3.8 1.6 0.2 Iris-setosa + 48 │ 4.6 3.2 1.4 0.2 Iris-setosa 49 │ 5.3 3.7 1.5 0.2 Iris-setosa 50 │ 5.0 3.3 1.4 0.2 Iris-setosa - 46 rows omitted -⋮ -Last Group (50 rows): Species = "Iris-virginica" + 35 rows omitted + 50×5 SubDataFrame Row │ SepalLength SepalWidth PetalLength PetalWidth Species - │ Float64 Float64 Float64 Float64 String15 + │ Float64 Float64 Float64 Float64 String15 +─────┼─────────────────────────────────────────────────────────────────── + 1 │ 7.0 3.2 4.7 1.4 Iris-versicolor + 2 │ 6.4 3.2 4.5 1.5 Iris-versicolor + 3 │ 6.9 3.1 4.9 1.5 Iris-versicolor + 4 │ 5.5 2.3 4.0 1.3 Iris-versicolor + 5 │ 6.5 2.8 4.6 1.5 Iris-versicolor + 6 │ 5.7 2.8 4.5 1.3 Iris-versicolor + 7 │ 6.3 3.3 4.7 1.6 Iris-versicolor + 8 │ 4.9 2.4 3.3 1.0 Iris-versicolor + ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ + 44 │ 5.0 2.3 3.3 1.0 Iris-versicolor + 45 │ 5.6 2.7 4.2 1.3 Iris-versicolor + 46 │ 5.7 3.0 4.2 1.2 Iris-versicolor + 47 │ 5.7 2.9 4.2 1.3 Iris-versicolor + 48 │ 6.2 2.9 4.3 1.3 Iris-versicolor + 49 │ 5.1 2.5 3.0 1.1 Iris-versicolor + 50 │ 5.7 2.8 4.1 1.3 Iris-versicolor + 35 rows omitted + 50×5 SubDataFrame + Row │ SepalLength SepalWidth PetalLength PetalWidth Species + │ Float64 Float64 Float64 Float64 String15 ─────┼────────────────────────────────────────────────────────────────── 1 │ 6.3 3.3 6.0 2.5 Iris-virginica 2 │ 5.8 2.7 5.1 1.9 Iris-virginica + 3 │ 7.1 3.0 5.9 2.1 Iris-virginica + 4 │ 6.3 2.9 5.6 1.8 Iris-virginica + 5 │ 6.5 3.0 5.8 2.2 Iris-virginica + 6 │ 7.6 3.0 6.6 2.1 Iris-virginica + 7 │ 4.9 2.5 4.5 1.7 Iris-virginica + 8 │ 7.3 2.9 6.3 1.8 Iris-virginica ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ + 44 │ 6.8 3.2 5.9 2.3 Iris-virginica + 45 │ 6.7 3.3 5.7 2.5 Iris-virginica + 46 │ 6.7 3.0 5.2 2.3 Iris-virginica + 47 │ 6.3 2.5 5.0 1.9 Iris-virginica + 48 │ 6.5 3.0 5.2 2.0 Iris-virginica + 49 │ 6.2 3.4 5.4 2.3 Iris-virginica 50 │ 5.9 3.0 5.1 1.8 Iris-virginica - 47 rows omitted - -julia> combine(gd, valuecols(gd) .=> mean) -3×5 DataFrame - Row │ Species SepalLength_mean SepalWidth_mean PetalLength_mean P ⋯ - │ String15 Float64 Float64 Float64 F ⋯ -─────┼────────────────────────────────────────────────────────────────────────── - 1 │ Iris-setosa 5.006 3.418 1.464 ⋯ - 2 │ Iris-versicolor 5.936 2.77 4.26 - 3 │ Iris-virginica 6.588 2.974 5.552 - 1 column omitted -``` - -Note that `GroupedDataFrame` is a view: therefore -grouping columns of its parent data frame must not be mutated, and -rows must not be added nor removed from it. If the number or rows -of the parent changes then an error is thrown when a child `GroupedDataFrame` -is used: -```jldoctest sac -julia> df = DataFrame(id=1:2) -2×1 DataFrame - Row │ id - │ Int64 -─────┼─────── - 1 │ 1 - 2 │ 2 - -julia> gd = groupby(df, :id) -GroupedDataFrame with 2 groups based on key: id -First Group (1 row): id = 1 - Row │ id - │ Int64 -─────┼─────── - 1 │ 1 -⋮ -Last Group (1 row): id = 2 - Row │ id - │ Int64 -─────┼─────── - 1 │ 2 - -julia> push!(df, [3]) -3×1 DataFrame - Row │ id - │ Int64 -─────┼─────── - 1 │ 1 - 2 │ 2 - 3 │ 3 - -julia> gd[1] -ERROR: AssertionError: The current number of rows in the parent data frame is 3 and it does not match the number of rows it contained when GroupedDataFrame was created which was 2. The number of rows in the parent data frame has likely been changed unintentionally (e.g. using subset!, filter!, deleteat!, push!, or append! functions). + 35 rows omitted + +julia> map(nrow, sdf_vec) +3-element Vector{Int64}: + 50 + 50 + 50 + +julia> nrow.(sdf_vec) +3-element Vector{Int64}: + 50 + 50 + 50 ``` -Sometimes it is useful to append rows to the source data frame of a -`GroupedDataFrame`, without affecting the rows used for grouping. -In such a scenario you can create the grouped data frame using a `view` -of the parent data frame to avoid the error: - -```jldoctest sac -julia> df = DataFrame(id=1:2) -2×1 DataFrame - Row │ id - │ Int64 -─────┼─────── - 1 │ 1 - 2 │ 2 - -julia> gd = groupby(view(df, :, :), :id) -GroupedDataFrame with 2 groups based on key: id -First Group (1 row): id = 1 - Row │ id - │ Int64 -─────┼─────── - 1 │ 1 -⋮ -Last Group (1 row): id = 2 - Row │ id - │ Int64 -─────┼─────── - 1 │ 2 - -julia> push!(df, [3]) -3×1 DataFrame - Row │ id - │ Int64 -─────┼─────── - 1 │ 1 - 2 │ 2 - 3 │ 3 - -julia> gd[1] -1×1 SubDataFrame - Row │ id - │ Int64 -─────┼─────── - 1 │ 1 -``` +Note, that using split-apply-combine strategy with operation specification +syntax usually will be faster than iterating a `GroupedDataFrame`. -# Simulating the SQL `where` clause +## Simulating the SQL `where` clause You can conveniently work on subsets of a data frame by using `SubDataFrame`s. Operations performed on such objects can either create a new data frame or be @@ -779,7 +884,7 @@ julia> df 6 │ 3 missing 6 ``` -# Context dependent expressions +## Context dependent expressions Operation specification language supports the following context dependent operations: @@ -897,7 +1002,7 @@ julia> nrow(df) This dual-use of `nrow` does not lead to ambiguities, and is meant to make it easier to remember this exception. -## Getting the proportion of rows +### Getting the proportion of rows If you want to get a proportion of rows per group in a `GroupedDataFrame` you can use the `proprow` and `proprow => [target column name]` context @@ -926,7 +1031,7 @@ julia> combine(gdf, proprow => "transaction_fraction") As opposed to `nrow`, `proprow` cannot be used outside of operation specification syntax and is only allowed when processing `GroupedDataFrame`. -## Getting the group number +### Getting the group number Another common operation is getting group number. Use the `groupindices` and `groupindices => [target column name]` context dependent expressions to get it: @@ -971,7 +1076,7 @@ julia> groupindices(gdf) So as for `nrow` we see that the result is similar, but just in a different context (normal function call vs. operation specification syntax). -## Getting a vector of group indices +### Getting a vector of group indices The last context dependent expression supported by operation is getting group indices. Use the `eachindex` and `eachindex => [target column name]` expressions @@ -1054,7 +1159,7 @@ julia> combine(gdf, eachindex, :customer_id => eachindex) ``` -## Passing a function in operation specification syntax +### Passing a function in operation specification syntax When discussing context dependent expressions it is important to remember that operation specification syntax allows you to pass a function (without From b6f48e084f78160b31c407de72d7c97cd8fa447a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Bogumi=C5=82=20Kami=C5=84ski?= Date: Thu, 17 Nov 2022 14:35:27 +0100 Subject: [PATCH 03/13] fix typo --- docs/src/man/split_apply_combine.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/src/man/split_apply_combine.md b/docs/src/man/split_apply_combine.md index 6dc5d238b4..684e6cc260 100644 --- a/docs/src/man/split_apply_combine.md +++ b/docs/src/man/split_apply_combine.md @@ -626,7 +626,7 @@ Last Group (5 rows): g = 501 ``` Note that although `GroupedDataFrame` is iterable and indexable it is not an -`AbstractVector`. For this reason currently it was designed that it does not +`AbstractVector`. For this reason currently it was decided that it does not support `map` nor broadcasting (to allow for making a decision in the future what result type they should produce). To apply a function to all groups of a data frame and get a vector of results either use a comprehension or `collect` From b4560f3c171048ff41107b7e78492c82a8be163b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Bogumi=C5=82=20Kami=C5=84ski?= Date: Thu, 17 Nov 2022 21:20:57 +0100 Subject: [PATCH 04/13] define gd properly --- docs/src/man/split_apply_combine.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/src/man/split_apply_combine.md b/docs/src/man/split_apply_combine.md index 684e6cc260..1991839413 100644 --- a/docs/src/man/split_apply_combine.md +++ b/docs/src/man/split_apply_combine.md @@ -633,6 +633,8 @@ data frame and get a vector of results either use a comprehension or `collect` `GroupedDataFrame` into a vector first. Here are examples of both approaches: ```jldoctest sac +julia> gd = groupby(iris, :Species); + julia> [nrow(sdf) for sdf in gd] 3-element Vector{Int64}: 50 From 7d8e5e880efbfac78470528cd00a405c93037a11 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Bogumi=C5=82=20Kami=C5=84ski?= Date: Mon, 21 Nov 2022 19:56:30 +0100 Subject: [PATCH 05/13] Apply suggestions from code review Co-authored-by: Milan Bouchet-Valat --- docs/src/man/split_apply_combine.md | 70 ++++++++++++++--------------- src/abstractdataframe/selection.jl | 4 +- 2 files changed, 35 insertions(+), 39 deletions(-) diff --git a/docs/src/man/split_apply_combine.md b/docs/src/man/split_apply_combine.md index 1991839413..0fd278b9bd 100644 --- a/docs/src/man/split_apply_combine.md +++ b/docs/src/man/split_apply_combine.md @@ -69,7 +69,7 @@ each subset of the `DataFrame`. This specification can be of the following forms except `AsTable` are allowed). 4. a `col => target_cols` pair, which renames the column `col` to `target_cols`, which must be single name (as a `Symbol` or a string), a vector of names or `AsTable`. -5. context dependent expressions `function => target_cols` or just `function` +5. context-dependent expressions `function => target_cols` or just `function` for specific `function`s where the input columns are omitted; without `target_cols` the new column has the same name as `function`, otherwise it must be single name (as a `Symbol` or a string). Supported `function`s are: @@ -512,7 +512,7 @@ julia> gd[1] 1 │ 1 ``` -## Using `GroupedDataFrame` as an itrable and indexable object +## Using `GroupedDataFrame` as an iterable and indexable object If you only want to split the data set into subsets, use the [`groupby`](@ref) function. You can then iterate `SubDataFrame`s that constitute the identified @@ -720,8 +720,9 @@ julia> nrow.(sdf_vec) 50 ``` -Note, that using split-apply-combine strategy with operation specification -syntax usually will be faster than iterating a `GroupedDataFrame`. +Note that using the split-apply-combine strategy with operation specification +syntax in `combine`, `select` or `transform` will usually be faster than iterating +a `GroupedDataFrame`. ## Simulating the SQL `where` clause @@ -886,17 +887,17 @@ julia> df 6 │ 3 missing 6 ``` -## Context dependent expressions +## Context-dependent expressions -Operation specification language supports the following context dependent -operations: +The operation specification language used with `combine`, `select` and `transform` +supports the following context-dependent operations: -* getting the number of rows (`nrow`); -* getting the proportion of rows (`proprow`); +* getting the number of rows in a group (`nrow`); +* getting the proportion of rows in a group (`proprow`); * getting the group number (`groupindices`); * getting a vector of group indices (`eachindex`). -These operations are context dependent, because they do not require input column +These operations are context-dependent, because they do not require specifying the input column name in the operation specification syntax. These four exceptions to the standard operation specification syntax were @@ -977,10 +978,10 @@ julia> combine(gdf, nrow => "transaction_count") ``` Note that in both cases we did not pass source column name as it is not needed -to determine the number of rows per group. This is the reason why context -dependent expressions are exceptions to standard operation specification syntax. +to determine the number of rows per group. This is the reason why context-dependent +expressions are exceptions to standard operation specification syntax. -Additionally the `nrow` expression also works in operation specification syntax +The `nrow` expression also works in the operation specification syntax applied to a data frame. Here is an example: ```jldoctest sac @@ -1001,14 +1002,14 @@ julia> nrow(df) 6 ``` -This dual-use of `nrow` does not lead to ambiguities, and is meant to make it +This dual use of `nrow` does not lead to ambiguities, and is meant to make it easier to remember this exception. ### Getting the proportion of rows If you want to get a proportion of rows per group in a `GroupedDataFrame` -you can use the `proprow` and `proprow => [target column name]` context -dependent expressions. Here are some examples: +you can use the `proprow` and `proprow => [target column name]` context-dependent +expressions. Here are some examples: ```jldoctest sac julia> combine(gdf, proprow) @@ -1030,13 +1031,13 @@ julia> combine(gdf, proprow => "transaction_fraction") 3 │ c 0.333333 ``` -As opposed to `nrow`, `proprow` cannot be used outside of operation -specification syntax and is only allowed when processing `GroupedDataFrame`. +As opposed to `nrow`, `proprow` cannot be used outside of the operation +specification syntax and is only allowed when processing a `GroupedDataFrame`. ### Getting the group number Another common operation is getting group number. Use the `groupindices` and -`groupindices => [target column name]` context dependent expressions to get it: +`groupindices => [target column name]` context-dependent expressions to get it: ```jldoctest sac @@ -1059,10 +1060,9 @@ julia> combine(gdf, groupindices => "group_number") 3 │ c 3 ``` -The `groupindices` name was chosen, because there exists the -[`groupindices`](@ref) function that applied to `GroupedDataFrame` returns -group indices for each row in the parent data frame of the passed -`GroupedDataFrame`: +Outside of the operation specification syntax, [`groupindices`](@ref) +is also a regular function which returns group indices for each row +in the parent data frame of the passed `GroupedDataFrame`: ```jldoctest sac julia> groupindices(gdf) @@ -1075,14 +1075,10 @@ julia> groupindices(gdf) 3 ``` -So as for `nrow` we see that the result is similar, but just in a different -context (normal function call vs. operation specification syntax). +### Getting a vector of indices within groups -### Getting a vector of group indices - -The last context dependent expression supported by operation is getting group -indices. Use the `eachindex` and `eachindex => [target column name]` expressions -to get it: +The last context-dependent expression supported by the operation +specification syntax is getting the index of each row within each group: ```jldoctest sac @@ -1111,8 +1107,8 @@ julia> combine(gdf, eachindex => "transaction_number") 6 │ c 2 ``` -Note that this operation also makes sense in a data frame context so it is -also supported: +Note that this operation also makes sense in a data frame context, +where all rows are considered to be in the same group: ```jldoctest sac julia> transform(df, eachindex) @@ -1161,11 +1157,11 @@ julia> combine(gdf, eachindex, :customer_id => eachindex) ``` -### Passing a function in operation specification syntax +## Context-dependent expressions versus functions When discussing context dependent expressions it is important to remember that operation specification syntax allows you to pass a function (without -source and target column names), in which case such a function get a +source and target column names), in which case such a function gets passed a `SubDataFrame` that represents a group in a `GroupedDataFrame`. Here is an example: @@ -1209,13 +1205,13 @@ Passing a function taking a `SubDataFrame` is a flexible functionality allowing you to perform complex operations on your data. However, you should bear in mind two aspects: -* Using full operation specification syntax (where source and target column - names are passe) will lead to faster execution of your code (as Julia +* Using the full operation specification syntax (where source and target column + names are passed) will lead to faster execution of your code (as the Julia compiler is able to better optimize execution of such operations) in comparison to just passing a function taking a `SubDataFrame`. * Although writing `row`, `proprow`, `groupindices`, and `eachindex` looks like just passing a function they **do not** take a `SubDataFrame` as their - argument. As we explained in this section, they are special context dependent + argument. As we explained in this section, they are special context-dependent expressions that are exceptions to the standard operation specification syntax rules. They were added for user convenience (and at the same time they are optimized to be fast). diff --git a/src/abstractdataframe/selection.jl b/src/abstractdataframe/selection.jl index 68bf4cd313..6cd4d4787d 100644 --- a/src/abstractdataframe/selection.jl +++ b/src/abstractdataframe/selection.jl @@ -75,7 +75,7 @@ const TRANSFORMATION_COMMON_RULES = except `AsTable` are allowed). 4. a `col => target_cols` pair, which renames the column `col` to `target_cols`, which must be single name (as a `Symbol` or a string), a vector of names or `AsTable`. - 5. context dependent expressions `function => target_cols` or just `function` + 5. context-dependent expressions `function => target_cols` or just `function` for specific `function`s where the input columns are omitted; without `target_cols` the new column has the same name as `function`, otherwise it must be single name (as a `Symbol` or a string). Supported `function`s are: @@ -1267,7 +1267,7 @@ julia> select(gd, :, AsTable(Not(:a)) => sum, renamecols=false) 8 │ 2 1 8 9 ``` -# context dependent expressions +# context-dependent expressions ```jldoctest julia> df = DataFrame(a=[1, 1, 1, 2, 2, 1, 1, 2], b=repeat([2, 1], outer=[4]), From 9acd38e88959e91307699e39baf563f1a5328909 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Bogumi=C5=82=20Kami=C5=84ski?= Date: Mon, 21 Nov 2022 20:29:43 +0100 Subject: [PATCH 06/13] updates after code review --- docs/src/man/split_apply_combine.md | 311 +++++++++++++++------------- 1 file changed, 162 insertions(+), 149 deletions(-) diff --git a/docs/src/man/split_apply_combine.md b/docs/src/man/split_apply_combine.md index 0fd278b9bd..0d9dcf95eb 100644 --- a/docs/src/man/split_apply_combine.md +++ b/docs/src/man/split_apply_combine.md @@ -220,7 +220,7 @@ julia> iris = CSV.read((joinpath(dirname(pathof(DataFrames)), 150 │ 5.9 3.0 5.1 1.8 Iris-virginica 135 rows omitted -julia> gdf = groupby(iris, :Species) +julia> iris_gdf = groupby(iris, :Species) GroupedDataFrame with 3 groups based on key: Species First Group (50 rows): Species = "Iris-setosa" Row │ SepalLength SepalWidth PetalLength PetalWidth Species @@ -243,7 +243,7 @@ Last Group (50 rows): Species = "Iris-virginica" 50 │ 5.9 3.0 5.1 1.8 Iris-virginica 47 rows omitted -julia> combine(gdf, :PetalLength => mean) +julia> combine(iris_gdf, :PetalLength => mean) 3×2 DataFrame Row │ Species PetalLength_mean │ String15 Float64 @@ -252,7 +252,7 @@ julia> combine(gdf, :PetalLength => mean) 2 │ Iris-versicolor 4.26 3 │ Iris-virginica 5.552 -julia> combine(gdf, nrow, proprow, groupindices) +julia> combine(iris_gdf, nrow, proprow, groupindices) 3×4 DataFrame Row │ Species nrow proprow groupindices │ String15 Int64 Float64 Int64 @@ -261,7 +261,7 @@ julia> combine(gdf, nrow, proprow, groupindices) 2 │ Iris-versicolor 50 0.333333 2 3 │ Iris-virginica 50 0.333333 3 -julia> combine(gdf, nrow, :PetalLength => mean => :mean) +julia> combine(iris_gdf, nrow, :PetalLength => mean => :mean) 3×3 DataFrame Row │ Species nrow mean │ String15 Int64 Float64 @@ -270,7 +270,9 @@ julia> combine(gdf, nrow, :PetalLength => mean => :mean) 2 │ Iris-versicolor 50 4.26 3 │ Iris-virginica 50 5.552 -julia> combine(gdf, [:PetalLength, :SepalLength] => ((p, s) -> (a=mean(p)/mean(s), b=sum(p))) => +julia> combine(iris_gdf, + [:PetalLength, :SepalLength] => + ((p, s) -> (a=mean(p)/mean(s), b=sum(p))) => AsTable) # multiple columns are passed as arguments 3×3 DataFrame Row │ Species a b @@ -280,7 +282,7 @@ julia> combine(gdf, [:PetalLength, :SepalLength] => ((p, s) -> (a=mean(p)/mean(s 2 │ Iris-versicolor 0.717655 213.0 3 │ Iris-virginica 0.842744 277.6 -julia> combine(gdf, +julia> combine(iris_gdf, AsTable([:PetalLength, :SepalLength]) => x -> std(x.PetalLength) / std(x.SepalLength)) # passing a NamedTuple 3×2 DataFrame @@ -291,7 +293,7 @@ julia> combine(gdf, 2 │ Iris-versicolor 0.910378 3 │ Iris-virginica 0.867923 -julia> combine(x -> std(x.PetalLength) / std(x.SepalLength), gdf) # passing a SubDataFrame +julia> combine(x -> std(x.PetalLength) / std(x.SepalLength), iris_gdf) # passing a SubDataFrame 3×2 DataFrame Row │ Species x1 │ String15 Float64 @@ -300,7 +302,7 @@ julia> combine(x -> std(x.PetalLength) / std(x.SepalLength), gdf) # passing a Su 2 │ Iris-versicolor 0.910378 3 │ Iris-virginica 0.867923 -julia> combine(gdf, 1:2 => cor, nrow) +julia> combine(iris_gdf, 1:2 => cor, nrow) 3×3 DataFrame Row │ Species SepalLength_SepalWidth_cor nrow │ String15 Float64 Int64 @@ -309,7 +311,7 @@ julia> combine(gdf, 1:2 => cor, nrow) 2 │ Iris-versicolor 0.525911 50 3 │ Iris-virginica 0.457228 50 -julia> combine(gdf, :PetalLength => (x -> [extrema(x)]) => [:min, :max]) +julia> combine(iris_gdf, :PetalLength => (x -> [extrema(x)]) => [:min, :max]) 3×3 DataFrame Row │ Species min max │ String15 Float64 Float64 @@ -321,7 +323,7 @@ julia> combine(gdf, :PetalLength => (x -> [extrema(x)]) => [:min, :max]) To get row number for each observation within each group use the `eachindex` function: ``` -julia> combine(gdf, eachindex) +julia> combine(iris_gdf, eachindex) 150×2 DataFrame Row │ Species eachindex │ String15 Int64 @@ -342,7 +344,7 @@ In the example below the return values in columns `:SepalLength_SepalWidth_cor` and `:nrow` are broadcasted to match the number of elements in each group: ``` -julia> select(gdf, 1:2 => cor) +julia> select(iris_gdf, 1:2 => cor) 150×2 DataFrame Row │ Species SepalLength_SepalWidth_cor │ String Float64 @@ -357,7 +359,7 @@ julia> select(gdf, 1:2 => cor) 150 │ Iris-virginica 0.457228 143 rows omitted -julia> transform(gdf, :Species => x -> chop.(x, head=5, tail=0)) +julia> transform(iris_gdf, :Species => x -> chop.(x, head=5, tail=0)) 150×6 DataFrame Row │ SepalLength SepalWidth PetalLength PetalWidth Species Species_function │ Float64 Float64 Float64 Float64 String SubString… @@ -377,7 +379,7 @@ All functions also support the `do` block form. However, as noted above, this form is slow and should therefore be avoided when performance matters. ```jldoctest sac -julia> combine(gdf) do df +julia> combine(iris_gdf) do df (m = mean(df.PetalLength), s² = var(df.PetalLength)) end 3×3 DataFrame @@ -392,30 +394,7 @@ julia> combine(gdf) do df To apply a function to each non-grouping column of a `GroupedDataFrame` you can write: ```jldoctest sac -julia> gd = groupby(iris, :Species) -GroupedDataFrame with 3 groups based on key: Species -First Group (50 rows): Species = "Iris-setosa" - Row │ SepalLength SepalWidth PetalLength PetalWidth Species - │ Float64 Float64 Float64 Float64 String15 -─────┼─────────────────────────────────────────────────────────────── - 1 │ 5.1 3.5 1.4 0.2 Iris-setosa - 2 │ 4.9 3.0 1.4 0.2 Iris-setosa - ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ - 49 │ 5.3 3.7 1.5 0.2 Iris-setosa - 50 │ 5.0 3.3 1.4 0.2 Iris-setosa - 46 rows omitted -⋮ -Last Group (50 rows): Species = "Iris-virginica" - Row │ SepalLength SepalWidth PetalLength PetalWidth Species - │ Float64 Float64 Float64 Float64 String15 -─────┼────────────────────────────────────────────────────────────────── - 1 │ 6.3 3.3 6.0 2.5 Iris-virginica - 2 │ 5.8 2.7 5.1 1.9 Iris-virginica - ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ - 50 │ 5.9 3.0 5.1 1.8 Iris-virginica - 47 rows omitted - -julia> combine(gd, valuecols(gd) .=> mean) +julia> combine(iris_gdf, valuecols(iris_gdf) .=> mean) 3×5 DataFrame Row │ Species SepalLength_mean SepalWidth_mean PetalLength_mean P ⋯ │ String15 Float64 Float64 Float64 F ⋯ @@ -431,6 +410,7 @@ grouping columns of its parent data frame must not be mutated, and rows must not be added nor removed from it. If the number or rows of the parent changes then an error is thrown when a child `GroupedDataFrame` is used: + ```jldoctest sac julia> df = DataFrame(id=1:2) 2×1 DataFrame @@ -519,7 +499,7 @@ function. You can then iterate `SubDataFrame`s that constitute the identified groups: ```jldoctest sac -julia> for subdf in groupby(iris, :Species) +julia> for subdf in iris_gdf println(size(subdf, 1)) end 50 @@ -531,7 +511,7 @@ To also get the values of the grouping columns along with each group, use the `pairs` function: ```jldoctest sac -julia> for (key, subdf) in pairs(groupby(iris, :Species)) +julia> for (key, subdf) in pairs(iris_gdf) println("Number of data points for $(key.Species): $(nrow(subdf))") end Number of data points for Iris-setosa: 50 @@ -539,92 +519,6 @@ Number of data points for Iris-versicolor: 50 Number of data points for Iris-virginica: 50 ``` -The value of `key` in the previous example is a [`DataFrames.GroupKey`](@ref) object, -which can be used in a similar fashion to a `NamedTuple`. - -Grouping a data frame using the `groupby` function can be seen as adding a lookup key -to it. Such lookups can be performed efficiently by indexing the resulting -`GroupedDataFrame` with a `Tuple` or `NamedTuple`: -```jldoctest sac -julia> df = DataFrame(g=repeat(1:1000, inner=5), x=1:5000) -5000×2 DataFrame - Row │ g x - │ Int64 Int64 -──────┼────────────── - 1 │ 1 1 - 2 │ 1 2 - 3 │ 1 3 - 4 │ 1 4 - 5 │ 1 5 - 6 │ 2 6 - 7 │ 2 7 - 8 │ 2 8 - ⋮ │ ⋮ ⋮ - 4994 │ 999 4994 - 4995 │ 999 4995 - 4996 │ 1000 4996 - 4997 │ 1000 4997 - 4998 │ 1000 4998 - 4999 │ 1000 4999 - 5000 │ 1000 5000 - 4985 rows omitted - -julia> gdf = groupby(df, :g) -GroupedDataFrame with 1000 groups based on key: g -First Group (5 rows): g = 1 - Row │ g x - │ Int64 Int64 -─────┼────────────── - 1 │ 1 1 - 2 │ 1 2 - 3 │ 1 3 - 4 │ 1 4 - 5 │ 1 5 -⋮ -Last Group (5 rows): g = 1000 - Row │ g x - │ Int64 Int64 -─────┼────────────── - 1 │ 1000 4996 - 2 │ 1000 4997 - 3 │ 1000 4998 - 4 │ 1000 4999 - 5 │ 1000 5000 - -julia> gdf[(g=500,)] -5×2 SubDataFrame - Row │ g x - │ Int64 Int64 -─────┼────────────── - 1 │ 500 2496 - 2 │ 500 2497 - 3 │ 500 2498 - 4 │ 500 2499 - 5 │ 500 2500 - -julia> gdf[[(500,), (501,)]] -GroupedDataFrame with 2 groups based on key: g -First Group (5 rows): g = 500 - Row │ g x - │ Int64 Int64 -─────┼────────────── - 1 │ 500 2496 - 2 │ 500 2497 - 3 │ 500 2498 - 4 │ 500 2499 - 5 │ 500 2500 -⋮ -Last Group (5 rows): g = 501 - Row │ g x - │ Int64 Int64 -─────┼────────────── - 1 │ 501 2501 - 2 │ 501 2502 - 3 │ 501 2503 - 4 │ 501 2504 - 5 │ 501 2505 -``` - Note that although `GroupedDataFrame` is iterable and indexable it is not an `AbstractVector`. For this reason currently it was decided that it does not support `map` nor broadcasting (to allow for making a decision in the future @@ -633,15 +527,13 @@ data frame and get a vector of results either use a comprehension or `collect` `GroupedDataFrame` into a vector first. Here are examples of both approaches: ```jldoctest sac -julia> gd = groupby(iris, :Species); - -julia> [nrow(sdf) for sdf in gd] +julia> [nrow(sdf) for sdf in iris_gdf] 3-element Vector{Int64}: 50 50 50 -julia> sdf_vec = collect(gd) +julia> sdf_vec = collect(iris_gdf) 3-element Vector{Any}: 50×5 SubDataFrame Row │ SepalLength SepalWidth PetalLength PetalWidth Species @@ -724,6 +616,121 @@ Note that using the split-apply-combine strategy with operation specification syntax in `combine`, `select` or `transform` will usually be faster than iterating a `GroupedDataFrame`. +The value of `key` in the example above where we iterated `pairs(iris_gdf)` +is a [`DataFrames.GroupKey`](@ref) object, +which can be used in a similar fashion to a `NamedTuple`. + +Grouping a data frame using the `groupby` function can be seen as adding a +lookup key to it. Such lookups can be performed efficiently by indexing the +resulting `GroupedDataFrame` with [`DataFrames.GroupKey`](@ref) (as it was +presented aboce) a `Tuple`, a `NamedTuple`, or a dictionary. Here are some +more examples of such indexing. + +```jldoctest sac +julia> df = DataFrame(g=repeat(1:1000, inner=5), x=1:5000) +5000×2 DataFrame + Row │ g x + │ Int64 Int64 +──────┼────────────── + 1 │ 1 1 + 2 │ 1 2 + 3 │ 1 3 + 4 │ 1 4 + 5 │ 1 5 + 6 │ 2 6 + 7 │ 2 7 + 8 │ 2 8 + ⋮ │ ⋮ ⋮ + 4994 │ 999 4994 + 4995 │ 999 4995 + 4996 │ 1000 4996 + 4997 │ 1000 4997 + 4998 │ 1000 4998 + 4999 │ 1000 4999 + 5000 │ 1000 5000 + 4985 rows omitted + +julia> gd = groupby(df, :g) +GroupedDataFrame with 1000 groups based on key: g +First Group (5 rows): g = 1 + Row │ g x + │ Int64 Int64 +─────┼────────────── + 1 │ 1 1 + 2 │ 1 2 + 3 │ 1 3 + 4 │ 1 4 + 5 │ 1 5 +⋮ +Last Group (5 rows): g = 1000 + Row │ g x + │ Int64 Int64 +─────┼────────────── + 1 │ 1000 4996 + 2 │ 1000 4997 + 3 │ 1000 4998 + 4 │ 1000 4999 + 5 │ 1000 5000 + +julia> gd[(g=500,)] # a NamedTuple +5×2 SubDataFrame + Row │ g x + │ Int64 Int64 +─────┼────────────── + 1 │ 500 2496 + 2 │ 500 2497 + 3 │ 500 2498 + 4 │ 500 2499 + 5 │ 500 2500 + +julia> gd[[(500,), (501,)]] # a vector of Tuples +GroupedDataFrame with 2 groups based on key: g +First Group (5 rows): g = 500 + Row │ g x + │ Int64 Int64 +─────┼────────────── + 1 │ 500 2496 + 2 │ 500 2497 + 3 │ 500 2498 + 4 │ 500 2499 + 5 │ 500 2500 +⋮ +Last Group (5 rows): g = 501 + Row │ g x + │ Int64 Int64 +─────┼────────────── + 1 │ 501 2501 + 2 │ 501 2502 + 3 │ 501 2503 + 4 │ 501 2504 + 5 │ 501 2505 + +julia> key = keys(gd) |> last # first key in gd +GroupKey: (g = 1000,) + +julia> gd[key] +5×2 SubDataFrame + Row │ g x + │ Int64 Int64 +─────┼────────────── + 1 │ 1000 4996 + 2 │ 1000 4997 + 3 │ 1000 4998 + 4 │ 1000 4999 + 5 │ 1000 5000 + +julia> gd[Dict("g" => 1000)] # a dictionary +5×2 SubDataFrame + Row │ g x + │ Int64 Int64 +─────┼────────────── + 1 │ 1000 4996 + 2 │ 1000 4997 + 3 │ 1000 4998 + 4 │ 1000 4999 + 5 │ 1000 5000 +``` + ## Simulating the SQL `where` clause You can conveniently work on subsets of a data frame by using `SubDataFrame`s. @@ -895,7 +902,7 @@ supports the following context-dependent operations: * getting the number of rows in a group (`nrow`); * getting the proportion of rows in a group (`proprow`); * getting the group number (`groupindices`); -* getting a vector of group indices (`eachindex`). +* getting a vector of indices within groups (`eachindex`). These operations are context-dependent, because they do not require specifying the input column name in the operation specification syntax. @@ -947,7 +954,7 @@ Group 3 (2 rows): customer_id = "c" 2 │ c 11 9 ``` -## Getting the number of rows +### Getting the number of rows You can get the number of rows per group in a `GroupedDataFrame` by just writing `nrow`, in which case the generated column name with the number of rows @@ -1050,6 +1057,18 @@ julia> combine(gdf, groupindices) 2 │ b 2 3 │ c 3 +julia> transform(gdf, groupindices) +6×4 DataFrame + Row │ customer_id transaction_id volume groupindices + │ String Int64 Int64 Int64 +─────┼─────────────────────────────────────────────────── + 1 │ a 12 2 1 + 2 │ b 15 3 2 + 3 │ b 19 1 2 + 4 │ b 17 4 2 + 5 │ c 13 5 3 + 6 │ c 11 9 3 + julia> combine(gdf, groupindices => "group_number") 3×2 DataFrame Row │ customer_id group_number @@ -1094,6 +1113,18 @@ julia> combine(gdf, eachindex) 5 │ c 1 6 │ c 2 +julia> select(gdf, eachindex, groupindices) +6×3 DataFrame + Row │ customer_id eachindex groupindices + │ String Int64 Int64 +─────┼────────────────────────────────────── + 1 │ a 1 1 + 2 │ b 1 2 + 3 │ b 2 2 + 4 │ b 3 2 + 5 │ c 1 3 + 6 │ c 2 3 + julia> combine(gdf, eachindex => "transaction_number") 6×2 DataFrame Row │ customer_id transaction_number @@ -1183,24 +1214,6 @@ generating the `:nrow` column with number of rows per group. However, the `SubDataFrame` as its argument and returns its number of rows (the `:x1` column name is a default auto-generated column name in this case). -To show you another example of passing a function consider the following case: - -```jldoctest sac -julia> combine(gdf, :volume => sum, x -> sum(x.volume)) -3×3 DataFrame - Row │ customer_id volume_sum x1 - │ String Int64 Int64 -─────┼──────────────────────────────── - 1 │ a 2 2 - 2 │ b 8 8 - 3 │ c 14 14 -``` - -Again, both `:volume_sum` and `:x1` columns hold the same data. The reason -is that in `:volume => sum` we just apply the `sum` function to the `:volume` -column, while in `x -> sum(x.volume`, `x` variable is a `SubDataFrame` -representing the whole group. - Passing a function taking a `SubDataFrame` is a flexible functionality allowing you to perform complex operations on your data. However, you should bear in mind two aspects: From 1ace7439dc1a555a7f840eeb7a06fed5dde3a6cf Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Bogumi=C5=82=20Kami=C5=84ski?= Date: Wed, 23 Nov 2022 11:24:45 +0100 Subject: [PATCH 07/13] switch to 'column-independent operations' --- docs/src/man/split_apply_combine.md | 32 ++++++++++++++--------------- src/abstractdataframe/selection.jl | 4 ++-- 2 files changed, 18 insertions(+), 18 deletions(-) diff --git a/docs/src/man/split_apply_combine.md b/docs/src/man/split_apply_combine.md index 0d9dcf95eb..8d0467ce18 100644 --- a/docs/src/man/split_apply_combine.md +++ b/docs/src/man/split_apply_combine.md @@ -69,7 +69,7 @@ each subset of the `DataFrame`. This specification can be of the following forms except `AsTable` are allowed). 4. a `col => target_cols` pair, which renames the column `col` to `target_cols`, which must be single name (as a `Symbol` or a string), a vector of names or `AsTable`. -5. context-dependent expressions `function => target_cols` or just `function` +5. column-independent operations `function => target_cols` or just `function` for specific `function`s where the input columns are omitted; without `target_cols` the new column has the same name as `function`, otherwise it must be single name (as a `Symbol` or a string). Supported `function`s are: @@ -894,17 +894,17 @@ julia> df 6 │ 3 missing 6 ``` -## Context-dependent expressions +## Column-independent operations The operation specification language used with `combine`, `select` and `transform` -supports the following context-dependent operations: +supports the following column-independent operations: * getting the number of rows in a group (`nrow`); * getting the proportion of rows in a group (`proprow`); * getting the group number (`groupindices`); * getting a vector of indices within groups (`eachindex`). -These operations are context-dependent, because they do not require specifying the input column +These operations are column-independent, because they do not require specifying the input column name in the operation specification syntax. These four exceptions to the standard operation specification syntax were @@ -985,8 +985,8 @@ julia> combine(gdf, nrow => "transaction_count") ``` Note that in both cases we did not pass source column name as it is not needed -to determine the number of rows per group. This is the reason why context-dependent -expressions are exceptions to standard operation specification syntax. +to determine the number of rows per group. This is the reason why column-independent +operations are exceptions to standard operation specification syntax. The `nrow` expression also works in the operation specification syntax applied to a data frame. Here is an example: @@ -1015,8 +1015,8 @@ easier to remember this exception. ### Getting the proportion of rows If you want to get a proportion of rows per group in a `GroupedDataFrame` -you can use the `proprow` and `proprow => [target column name]` context-dependent -expressions. Here are some examples: +you can use the `proprow` and `proprow => [target column name]` column-independent +operations. Here are some examples: ```jldoctest sac julia> combine(gdf, proprow) @@ -1044,7 +1044,7 @@ specification syntax and is only allowed when processing a `GroupedDataFrame`. ### Getting the group number Another common operation is getting group number. Use the `groupindices` and -`groupindices => [target column name]` context-dependent expressions to get it: +`groupindices => [target column name]` column-independent operations to get it: ```jldoctest sac @@ -1096,7 +1096,7 @@ julia> groupindices(gdf) ### Getting a vector of indices within groups -The last context-dependent expression supported by the operation +The last column-independent operation supported by the operation specification syntax is getting the index of each row within each group: @@ -1188,13 +1188,13 @@ julia> combine(gdf, eachindex, :customer_id => eachindex) ``` -## Context-dependent expressions versus functions +## Column-independent operations versus functions -When discussing context dependent expressions it is important to remember +When discussing column-independent operations it is important to remember that operation specification syntax allows you to pass a function (without source and target column names), in which case such a function gets passed a `SubDataFrame` that represents a group in a `GroupedDataFrame`. Here is an -example: +example comparing column-independent operation and a function: ```jldoctest sac julia> combine(gdf, nrow, x -> nrow(x)) @@ -1208,7 +1208,7 @@ julia> combine(gdf, nrow, x -> nrow(x)) ``` Notice that columns `:nrow` and `:x1` have an identical contents. This is -expected. We already know that `nrow` is a context dependent expression +expected. We already know that `nrow` is a column-independent operation generating the `:nrow` column with number of rows per group. However, the `x -> nrow(x)` anonymous function does exactly the same as it gets a `SubDataFrame` as its argument and returns its number of rows (the `:x1` column @@ -1224,8 +1224,8 @@ two aspects: comparison to just passing a function taking a `SubDataFrame`. * Although writing `row`, `proprow`, `groupindices`, and `eachindex` looks like just passing a function they **do not** take a `SubDataFrame` as their - argument. As we explained in this section, they are special context-dependent - expressions that are exceptions to the standard operation specification syntax + argument. As we explained in this section, they are special column-independent + operations that are exceptions to the standard operation specification syntax rules. They were added for user convenience (and at the same time they are optimized to be fast). diff --git a/src/abstractdataframe/selection.jl b/src/abstractdataframe/selection.jl index 6cd4d4787d..bb1f8a070a 100644 --- a/src/abstractdataframe/selection.jl +++ b/src/abstractdataframe/selection.jl @@ -75,7 +75,7 @@ const TRANSFORMATION_COMMON_RULES = except `AsTable` are allowed). 4. a `col => target_cols` pair, which renames the column `col` to `target_cols`, which must be single name (as a `Symbol` or a string), a vector of names or `AsTable`. - 5. context-dependent expressions `function => target_cols` or just `function` + 5. column-independent operations `function => target_cols` or just `function` for specific `function`s where the input columns are omitted; without `target_cols` the new column has the same name as `function`, otherwise it must be single name (as a `Symbol` or a string). Supported `function`s are: @@ -1267,7 +1267,7 @@ julia> select(gd, :, AsTable(Not(:a)) => sum, renamecols=false) 8 │ 2 1 8 9 ``` -# context-dependent expressions +# column-independent operations ```jldoctest julia> df = DataFrame(a=[1, 1, 1, 2, 2, 1, 1, 2], b=repeat([2, 1], outer=[4]), From 010f68ef3fc9d40d083e9d36537144b422db30f0 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Bogumi=C5=82=20Kami=C5=84ski?= Date: Tue, 29 Nov 2022 17:21:39 +0100 Subject: [PATCH 08/13] Apply suggestions from code review Co-authored-by: Milan Bouchet-Valat --- docs/src/man/split_apply_combine.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/src/man/split_apply_combine.md b/docs/src/man/split_apply_combine.md index 8d0467ce18..ed8b605f98 100644 --- a/docs/src/man/split_apply_combine.md +++ b/docs/src/man/split_apply_combine.md @@ -1194,7 +1194,7 @@ When discussing column-independent operations it is important to remember that operation specification syntax allows you to pass a function (without source and target column names), in which case such a function gets passed a `SubDataFrame` that represents a group in a `GroupedDataFrame`. Here is an -example comparing column-independent operation and a function: +example comparing a column-independent operation and a function: ```jldoctest sac julia> combine(gdf, nrow, x -> nrow(x)) @@ -1207,7 +1207,7 @@ julia> combine(gdf, nrow, x -> nrow(x)) 3 │ c 2 2 ``` -Notice that columns `:nrow` and `:x1` have an identical contents. This is +Notice that columns `:nrow` and `:x1` have identical contents. This is expected. We already know that `nrow` is a column-independent operation generating the `:nrow` column with number of rows per group. However, the `x -> nrow(x)` anonymous function does exactly the same as it gets a From 20a0f90bebf178caefe56862ef14e91bb552e15d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Bogumi=C5=82=20Kami=C5=84ski?= Date: Tue, 29 Nov 2022 17:49:40 +0100 Subject: [PATCH 09/13] improve explanations --- docs/src/man/split_apply_combine.md | 63 +++++++++++++++++++++++------ 1 file changed, 50 insertions(+), 13 deletions(-) diff --git a/docs/src/man/split_apply_combine.md b/docs/src/man/split_apply_combine.md index ed8b605f98..7c3e1a05e0 100644 --- a/docs/src/man/split_apply_combine.md +++ b/docs/src/man/split_apply_combine.md @@ -1197,7 +1197,41 @@ source and target column names), in which case such a function gets passed a example comparing a column-independent operation and a function: ```jldoctest sac -julia> combine(gdf, nrow, x -> nrow(x)) +julia> combine(gdf, eachindex, sdf -> axes(sdf, 1)) +6×3 DataFrame + Row │ customer_id eachindex x1 + │ String Int64 Int64 +─────┼─────────────────────────────── + 1 │ a 1 1 + 2 │ b 1 1 + 3 │ b 2 2 + 4 │ b 3 3 + 5 │ c 1 1 + 6 │ c 2 2 +``` + +Notice that column independent operation `eachindex` produces the same result +as using anonymous function `sdf -> axes(sdf, 1)` that takes a `SubDataFrame` +as its first argument and returns indices along its first axes. +Importantly without special definition of column-independent operation +the `eachindex` function would fail when being passed as you can see here: + +```jldoctest sac +julia> combine(gdf, eachindex, sdf -> eachindex(sdf)) +ERROR: MethodError: no method matching keys(::SubDataFrame{DataFrame, DataFrames.Index, Vector{Int64}}) +``` + +The reason for this error is that `eachindex` function does not allow passing a +`SubDataFrame` as its argument. + +The same situation is with `proprow` and `groupindices`. They would not work +with a `SubDataFrame` as stand-alone functions. + +A bit different case is with `nrow` column-independent operation. In this case +the `nrow` function accepts `SubDataFrame` as an argument: + +```jldoctest sac +julia> combine(gdf, nrow, sdf -> nrow(sdf)) 3×3 DataFrame Row │ customer_id nrow x1 │ String Int64 Int64 @@ -1207,12 +1241,13 @@ julia> combine(gdf, nrow, x -> nrow(x)) 3 │ c 2 2 ``` -Notice that columns `:nrow` and `:x1` have identical contents. This is -expected. We already know that `nrow` is a column-independent operation -generating the `:nrow` column with number of rows per group. However, the -`x -> nrow(x)` anonymous function does exactly the same as it gets a -`SubDataFrame` as its argument and returns its number of rows (the `:x1` column -name is a default auto-generated column name in this case). +Notice that columns `:nrow` and `:x1` have identical contents, but the +difference is that they do not have the same names. `nrow` is a +column-independent operation generating the `:nrow` column name by default with +number of rows per group. On the other hand, the `sdf -> nrow(sdf)` anonymous +function does gets a `SubDataFrame` as its argument and returns its number of +rows. The `:x1` column name is a default auto-generated column name when +processing anonymous functions. Passing a function taking a `SubDataFrame` is a flexible functionality allowing you to perform complex operations on your data. However, you should bear in mind @@ -1222,10 +1257,12 @@ two aspects: names are passed) will lead to faster execution of your code (as the Julia compiler is able to better optimize execution of such operations) in comparison to just passing a function taking a `SubDataFrame`. -* Although writing `row`, `proprow`, `groupindices`, and `eachindex` looks like - just passing a function they **do not** take a `SubDataFrame` as their - argument. As we explained in this section, they are special column-independent - operations that are exceptions to the standard operation specification syntax - rules. They were added for user convenience (and at the same time they are - optimized to be fast). +* Although writing `nrow`, `proprow`, `groupindices`, and `eachindex` looks + like just passing a function they internally **do not** take a `SubDataFrame` + as their argument. As we explained in this section, `proprow`, + `groupindices`, and `eachindex` would not work with `SubDataFrame` as their + argument, and `nrow` would work, but would prouce a different column name. + Instead, these four operations are special column-independent operations that + are exceptions to the standard operation specification syntax rules. They + were added for user convenience. From 70e385ab94374a20ade3bd0e46ebf65e2cd12809 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Bogumi=C5=82=20Kami=C5=84ski?= Date: Wed, 30 Nov 2022 08:42:17 +0100 Subject: [PATCH 10/13] Apply suggestions from code review Co-authored-by: Milan Bouchet-Valat --- docs/src/man/split_apply_combine.md | 25 +++++++++++++------------ 1 file changed, 13 insertions(+), 12 deletions(-) diff --git a/docs/src/man/split_apply_combine.md b/docs/src/man/split_apply_combine.md index 7c3e1a05e0..71508f3413 100644 --- a/docs/src/man/split_apply_combine.md +++ b/docs/src/man/split_apply_combine.md @@ -1210,24 +1210,24 @@ julia> combine(gdf, eachindex, sdf -> axes(sdf, 1)) 6 │ c 2 2 ``` -Notice that column independent operation `eachindex` produces the same result -as using anonymous function `sdf -> axes(sdf, 1)` that takes a `SubDataFrame` +Notice that the column independent operation `eachindex` produces the same result +as using the anonymous function `sdf -> axes(sdf, 1)` that takes a `SubDataFrame` as its first argument and returns indices along its first axes. -Importantly without special definition of column-independent operation +Importantly if it wasn't defined as a column-independent operation the `eachindex` function would fail when being passed as you can see here: ```jldoctest sac -julia> combine(gdf, eachindex, sdf -> eachindex(sdf)) +julia> combine(gdf, sdf -> eachindex(sdf)) ERROR: MethodError: no method matching keys(::SubDataFrame{DataFrame, DataFrames.Index, Vector{Int64}}) ``` -The reason for this error is that `eachindex` function does not allow passing a +The reason for this error is that the `eachindex` function does not allow passing a `SubDataFrame` as its argument. -The same situation is with `proprow` and `groupindices`. They would not work +The same applies to `proprow` and `groupindices`: they would not work with a `SubDataFrame` as stand-alone functions. -A bit different case is with `nrow` column-independent operation. In this case +The `nrow` column-independent operation is a different case, as the `nrow` function accepts `SubDataFrame` as an argument: ```jldoctest sac @@ -1246,7 +1246,7 @@ difference is that they do not have the same names. `nrow` is a column-independent operation generating the `:nrow` column name by default with number of rows per group. On the other hand, the `sdf -> nrow(sdf)` anonymous function does gets a `SubDataFrame` as its argument and returns its number of -rows. The `:x1` column name is a default auto-generated column name when +rows. The `:x1` column name is the default auto-generated column name when processing anonymous functions. Passing a function taking a `SubDataFrame` is a flexible functionality allowing @@ -1254,14 +1254,15 @@ you to perform complex operations on your data. However, you should bear in mind two aspects: * Using the full operation specification syntax (where source and target column - names are passed) will lead to faster execution of your code (as the Julia - compiler is able to better optimize execution of such operations) in - comparison to just passing a function taking a `SubDataFrame`. + names are passed) or column-independent operations will lead to faster + execution of your code (as the Julia compiler is able to better optimize + execution of such operations) in comparison to passing a function + taking a `SubDataFrame`. * Although writing `nrow`, `proprow`, `groupindices`, and `eachindex` looks like just passing a function they internally **do not** take a `SubDataFrame` as their argument. As we explained in this section, `proprow`, `groupindices`, and `eachindex` would not work with `SubDataFrame` as their - argument, and `nrow` would work, but would prouce a different column name. + argument, and `nrow` would work, but would produce a different column name. Instead, these four operations are special column-independent operations that are exceptions to the standard operation specification syntax rules. They were added for user convenience. From 8eab2ed6eca1fe6f4f5c093546d1cbf871861fb4 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Bogumi=C5=82=20Kami=C5=84ski?= Date: Wed, 30 Nov 2022 09:19:14 +0100 Subject: [PATCH 11/13] update iteration and indexing examples --- docs/src/man/split_apply_combine.md | 231 +++++++++++++--------------- 1 file changed, 111 insertions(+), 120 deletions(-) diff --git a/docs/src/man/split_apply_combine.md b/docs/src/man/split_apply_combine.md index 71508f3413..50a98841b6 100644 --- a/docs/src/man/split_apply_combine.md +++ b/docs/src/man/split_apply_combine.md @@ -519,6 +519,94 @@ Number of data points for Iris-versicolor: 50 Number of data points for Iris-virginica: 50 ``` +The value of `key` in the example above where we iterated `pairs(iris_gdf)` is +a [`DataFrames.GroupKey`](@ref) object, which can be used in a similar fashion +to a `NamedTuple`. + +Grouping a data frame using the `groupby` function can be seen as adding a +lookup key to it. Such lookups can be performed efficiently by indexing the +resulting `GroupedDataFrame` with [`DataFrames.GroupKey`](@ref) (as it was +presented above) a `Tuple`, a `NamedTuple`, or a dictionary. Here are some +more examples of such indexing. + +```jldoctest sac +julia> iris_gdf[(Species="Iris-virginica",)] # a NamedTuple +50×5 SubDataFrame + Row │ SepalLength SepalWidth PetalLength PetalWidth Species + │ Float64 Float64 Float64 Float64 String15 +─────┼────────────────────────────────────────────────────────────────── + 1 │ 6.3 3.3 6.0 2.5 Iris-virginica + 2 │ 5.8 2.7 5.1 1.9 Iris-virginica + 3 │ 7.1 3.0 5.9 2.1 Iris-virginica + ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ + 48 │ 6.5 3.0 5.2 2.0 Iris-virginica + 49 │ 6.2 3.4 5.4 2.3 Iris-virginica + 50 │ 5.9 3.0 5.1 1.8 Iris-virginica + 44 rows omitted + +julia> iris_gdf[[("Iris-virginica",), ("Iris-setosa",)]] # a vector of Tuples +GroupedDataFrame with 2 groups based on key: Species +First Group (50 rows): Species = "Iris-virginica" + Row │ SepalLength SepalWidth PetalLength PetalWidth Species + │ Float64 Float64 Float64 Float64 String15 +─────┼────────────────────────────────────────────────────────────────── + 1 │ 6.3 3.3 6.0 2.5 Iris-virginica + ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ + 50 │ 5.9 3.0 5.1 1.8 Iris-virginica + 48 rows omitted +⋮ +Last Group (50 rows): Species = "Iris-setosa" + Row │ SepalLength SepalWidth PetalLength PetalWidth Species + │ Float64 Float64 Float64 Float64 String15 +─────┼─────────────────────────────────────────────────────────────── + 1 │ 5.1 3.5 1.4 0.2 Iris-setosa + ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ + 50 │ 5.0 3.3 1.4 0.2 Iris-setosa + 48 rows omitted + +julia> key = keys(iris_gdf) |> last # last key in iris_gdf +GroupKey: (Species = String15("Iris-virginica"),) + +julia> iris_gdf[key] +50×5 SubDataFrame + Row │ SepalLength SepalWidth PetalLength PetalWidth Species + │ Float64 Float64 Float64 Float64 String15 +─────┼────────────────────────────────────────────────────────────────── + 1 │ 6.3 3.3 6.0 2.5 Iris-virginica + 2 │ 5.8 2.7 5.1 1.9 Iris-virginica + 3 │ 7.1 3.0 5.9 2.1 Iris-virginica + 4 │ 6.3 2.9 5.6 1.8 Iris-virginica + 5 │ 6.5 3.0 5.8 2.2 Iris-virginica + 6 │ 7.6 3.0 6.6 2.1 Iris-virginica + ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ + 45 │ 6.7 3.3 5.7 2.5 Iris-virginica + 46 │ 6.7 3.0 5.2 2.3 Iris-virginica + 47 │ 6.3 2.5 5.0 1.9 Iris-virginica + 48 │ 6.5 3.0 5.2 2.0 Iris-virginica + 49 │ 6.2 3.4 5.4 2.3 Iris-virginica + 50 │ 5.9 3.0 5.1 1.8 Iris-virginica + 38 rows omitted +julia> iris_gdf[Dict("Species" => "Iris-setosa")] # a dictionary +50×5 SubDataFrame + Row │ SepalLength SepalWidth PetalLength PetalWidth Species + │ Float64 Float64 Float64 Float64 String15 +─────┼─────────────────────────────────────────────────────────────── + 1 │ 5.1 3.5 1.4 0.2 Iris-setosa + 2 │ 4.9 3.0 1.4 0.2 Iris-setosa + 3 │ 4.7 3.2 1.3 0.2 Iris-setosa + 4 │ 4.6 3.1 1.5 0.2 Iris-setosa + 5 │ 5.0 3.6 1.4 0.2 Iris-setosa + 6 │ 5.4 3.9 1.7 0.4 Iris-setosa + ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ + 45 │ 5.1 3.8 1.9 0.4 Iris-setosa + 46 │ 4.8 3.0 1.4 0.3 Iris-setosa + 47 │ 5.1 3.8 1.6 0.2 Iris-setosa + 48 │ 4.6 3.2 1.4 0.2 Iris-setosa + 49 │ 5.3 3.7 1.5 0.2 Iris-setosa + 50 │ 5.0 3.3 1.4 0.2 Iris-setosa + 38 rows omitted +``` + Note that although `GroupedDataFrame` is iterable and indexable it is not an `AbstractVector`. For this reason currently it was decided that it does not support `map` nor broadcasting (to allow for making a decision in the future @@ -527,12 +615,6 @@ data frame and get a vector of results either use a comprehension or `collect` `GroupedDataFrame` into a vector first. Here are examples of both approaches: ```jldoctest sac -julia> [nrow(sdf) for sdf in iris_gdf] -3-element Vector{Int64}: - 50 - 50 - 50 - julia> sdf_vec = collect(iris_gdf) 3-element Vector{Any}: 50×5 SubDataFrame @@ -612,123 +694,32 @@ julia> nrow.(sdf_vec) 50 ``` -Note that using the split-apply-combine strategy with operation specification -syntax in `combine`, `select` or `transform` will usually be faster than iterating -a `GroupedDataFrame`. +Since `GroupedDataFrame` is iterable, you can achieve the same result with a +comprehension: -The value of `key` in the example above where we iterated `pairs(iris_gdf)` -is a [`DataFrames.GroupKey`](@ref) object, -which can be used in a similar fashion to a `NamedTuple`. +```jldoctest sac +julia> [nrow(sdf) for sdf in iris_gdf] +3-element Vector{Int64}: + 50 + 50 + 50 +``` -Grouping a data frame using the `groupby` function can be seen as adding a -lookup key to it. Such lookups can be performed efficiently by indexing the -resulting `GroupedDataFrame` with [`DataFrames.GroupKey`](@ref) (as it was -presented aboce) a `Tuple`, a `NamedTuple`, or a dictionary. Here are some -more examples of such indexing. +Note that using the split-apply-combine strategy with operation specification +syntax in `combine`, `select` or `transform` will usually be faster for large +`GroupedDataFrame` object than iterating it, with the difference that they +produce a data frame. For the above examples an operation corresponding +to the examples above is: -```jldoctest sac -julia> df = DataFrame(g=repeat(1:1000, inner=5), x=1:5000) -5000×2 DataFrame - Row │ g x - │ Int64 Int64 -──────┼────────────── - 1 │ 1 1 - 2 │ 1 2 - 3 │ 1 3 - 4 │ 1 4 - 5 │ 1 5 - 6 │ 2 6 - 7 │ 2 7 - 8 │ 2 8 - ⋮ │ ⋮ ⋮ - 4994 │ 999 4994 - 4995 │ 999 4995 - 4996 │ 1000 4996 - 4997 │ 1000 4997 - 4998 │ 1000 4998 - 4999 │ 1000 4999 - 5000 │ 1000 5000 - 4985 rows omitted - -julia> gd = groupby(df, :g) -GroupedDataFrame with 1000 groups based on key: g -First Group (5 rows): g = 1 - Row │ g x - │ Int64 Int64 -─────┼────────────── - 1 │ 1 1 - 2 │ 1 2 - 3 │ 1 3 - 4 │ 1 4 - 5 │ 1 5 -⋮ -Last Group (5 rows): g = 1000 - Row │ g x - │ Int64 Int64 -─────┼────────────── - 1 │ 1000 4996 - 2 │ 1000 4997 - 3 │ 1000 4998 - 4 │ 1000 4999 - 5 │ 1000 5000 - -julia> gd[(g=500,)] # a NamedTuple -5×2 SubDataFrame - Row │ g x - │ Int64 Int64 -─────┼────────────── - 1 │ 500 2496 - 2 │ 500 2497 - 3 │ 500 2498 - 4 │ 500 2499 - 5 │ 500 2500 - -julia> gd[[(500,), (501,)]] # a vector of Tuples -GroupedDataFrame with 2 groups based on key: g -First Group (5 rows): g = 500 - Row │ g x - │ Int64 Int64 -─────┼────────────── - 1 │ 500 2496 - 2 │ 500 2497 - 3 │ 500 2498 - 4 │ 500 2499 - 5 │ 500 2500 -⋮ -Last Group (5 rows): g = 501 - Row │ g x - │ Int64 Int64 -─────┼────────────── - 1 │ 501 2501 - 2 │ 501 2502 - 3 │ 501 2503 - 4 │ 501 2504 - 5 │ 501 2505 - -julia> key = keys(gd) |> last # first key in gd -GroupKey: (g = 1000,) - -julia> gd[key] -5×2 SubDataFrame - Row │ g x - │ Int64 Int64 -─────┼────────────── - 1 │ 1000 4996 - 2 │ 1000 4997 - 3 │ 1000 4998 - 4 │ 1000 4999 - 5 │ 1000 5000 - -julia> gd[Dict("g" => 1000)] # a dictionary -5×2 SubDataFrame - Row │ g x - │ Int64 Int64 -─────┼────────────── - 1 │ 1000 4996 - 2 │ 1000 4997 - 3 │ 1000 4998 - 4 │ 1000 4999 - 5 │ 1000 5000 +``` +julia> combine(iris_gdf, nrow) +3×2 DataFrame + Row │ Species nrow + │ String15 Int64 +─────┼──────────────────────── + 1 │ Iris-setosa 50 + 2 │ Iris-versicolor 50 + 3 │ Iris-virginica 50 ``` ## Simulating the SQL `where` clause From 2a1009dbb4b94b04eadb3e69cabc34d0546660ab Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Bogumi=C5=82=20Kami=C5=84ski?= Date: Wed, 30 Nov 2022 10:59:41 +0100 Subject: [PATCH 12/13] fix docs output --- docs/src/man/split_apply_combine.md | 29 ++++++++++++++++++++++++----- 1 file changed, 24 insertions(+), 5 deletions(-) diff --git a/docs/src/man/split_apply_combine.md b/docs/src/man/split_apply_combine.md index 50a98841b6..50b7af151f 100644 --- a/docs/src/man/split_apply_combine.md +++ b/docs/src/man/split_apply_combine.md @@ -538,11 +538,20 @@ julia> iris_gdf[(Species="Iris-virginica",)] # a NamedTuple 1 │ 6.3 3.3 6.0 2.5 Iris-virginica 2 │ 5.8 2.7 5.1 1.9 Iris-virginica 3 │ 7.1 3.0 5.9 2.1 Iris-virginica + 4 │ 6.3 2.9 5.6 1.8 Iris-virginica + 5 │ 6.5 3.0 5.8 2.2 Iris-virginica + 6 │ 7.6 3.0 6.6 2.1 Iris-virginica + 7 │ 4.9 2.5 4.5 1.7 Iris-virginica + 8 │ 7.3 2.9 6.3 1.8 Iris-virginica ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ + 44 │ 6.8 3.2 5.9 2.3 Iris-virginica + 45 │ 6.7 3.3 5.7 2.5 Iris-virginica + 46 │ 6.7 3.0 5.2 2.3 Iris-virginica + 47 │ 6.3 2.5 5.0 1.9 Iris-virginica 48 │ 6.5 3.0 5.2 2.0 Iris-virginica 49 │ 6.2 3.4 5.4 2.3 Iris-virginica 50 │ 5.9 3.0 5.1 1.8 Iris-virginica - 44 rows omitted + 35 rows omitted julia> iris_gdf[[("Iris-virginica",), ("Iris-setosa",)]] # a vector of Tuples GroupedDataFrame with 2 groups based on key: Species @@ -551,18 +560,21 @@ First Group (50 rows): Species = "Iris-virginica" │ Float64 Float64 Float64 Float64 String15 ─────┼────────────────────────────────────────────────────────────────── 1 │ 6.3 3.3 6.0 2.5 Iris-virginica + 2 │ 5.8 2.7 5.1 1.9 Iris-virginica ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ + 49 │ 6.2 3.4 5.4 2.3 Iris-virginica 50 │ 5.9 3.0 5.1 1.8 Iris-virginica - 48 rows omitted + 46 rows omitted ⋮ Last Group (50 rows): Species = "Iris-setosa" Row │ SepalLength SepalWidth PetalLength PetalWidth Species │ Float64 Float64 Float64 Float64 String15 ─────┼─────────────────────────────────────────────────────────────── 1 │ 5.1 3.5 1.4 0.2 Iris-setosa + 2 │ 4.9 3.0 1.4 0.2 Iris-setosa ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ 50 │ 5.0 3.3 1.4 0.2 Iris-setosa - 48 rows omitted + 47 rows omitted julia> key = keys(iris_gdf) |> last # last key in iris_gdf GroupKey: (Species = String15("Iris-virginica"),) @@ -578,14 +590,18 @@ julia> iris_gdf[key] 4 │ 6.3 2.9 5.6 1.8 Iris-virginica 5 │ 6.5 3.0 5.8 2.2 Iris-virginica 6 │ 7.6 3.0 6.6 2.1 Iris-virginica + 7 │ 4.9 2.5 4.5 1.7 Iris-virginica + 8 │ 7.3 2.9 6.3 1.8 Iris-virginica ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ + 44 │ 6.8 3.2 5.9 2.3 Iris-virginica 45 │ 6.7 3.3 5.7 2.5 Iris-virginica 46 │ 6.7 3.0 5.2 2.3 Iris-virginica 47 │ 6.3 2.5 5.0 1.9 Iris-virginica 48 │ 6.5 3.0 5.2 2.0 Iris-virginica 49 │ 6.2 3.4 5.4 2.3 Iris-virginica 50 │ 5.9 3.0 5.1 1.8 Iris-virginica - 38 rows omitted + 35 rows omitted + julia> iris_gdf[Dict("Species" => "Iris-setosa")] # a dictionary 50×5 SubDataFrame Row │ SepalLength SepalWidth PetalLength PetalWidth Species @@ -597,14 +613,17 @@ julia> iris_gdf[Dict("Species" => "Iris-setosa")] # a dictionary 4 │ 4.6 3.1 1.5 0.2 Iris-setosa 5 │ 5.0 3.6 1.4 0.2 Iris-setosa 6 │ 5.4 3.9 1.7 0.4 Iris-setosa + 7 │ 4.6 3.4 1.4 0.3 Iris-setosa + 8 │ 5.0 3.4 1.5 0.2 Iris-setosa ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ + 44 │ 5.0 3.5 1.6 0.6 Iris-setosa 45 │ 5.1 3.8 1.9 0.4 Iris-setosa 46 │ 4.8 3.0 1.4 0.3 Iris-setosa 47 │ 5.1 3.8 1.6 0.2 Iris-setosa 48 │ 4.6 3.2 1.4 0.2 Iris-setosa 49 │ 5.3 3.7 1.5 0.2 Iris-setosa 50 │ 5.0 3.3 1.4 0.2 Iris-setosa - 38 rows omitted + 35 rows omitted ``` Note that although `GroupedDataFrame` is iterable and indexable it is not an From 3fc1ddd9941db3fc43cc499136b0fdf3e48e1c60 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Bogumi=C5=82=20Kami=C5=84ski?= Date: Thu, 1 Dec 2022 08:40:20 +0100 Subject: [PATCH 13/13] Apply suggestions from code review Co-authored-by: Milan Bouchet-Valat --- docs/src/man/split_apply_combine.md | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/docs/src/man/split_apply_combine.md b/docs/src/man/split_apply_combine.md index 50b7af151f..8961744a89 100644 --- a/docs/src/man/split_apply_combine.md +++ b/docs/src/man/split_apply_combine.md @@ -724,11 +724,10 @@ julia> [nrow(sdf) for sdf in iris_gdf] 50 ``` -Note that using the split-apply-combine strategy with operation specification +Note that using the split-apply-combine strategy with the operation specification syntax in `combine`, `select` or `transform` will usually be faster for large -`GroupedDataFrame` object than iterating it, with the difference that they -produce a data frame. For the above examples an operation corresponding -to the examples above is: +`GroupedDataFrame` objects than iterating them, with the difference that they +produce a data frame. An operation corresponding to the example above is: ``` julia> combine(iris_gdf, nrow) @@ -1220,7 +1219,7 @@ julia> combine(gdf, eachindex, sdf -> axes(sdf, 1)) 6 │ c 2 2 ``` -Notice that the column independent operation `eachindex` produces the same result +Notice that the column-independent operation `eachindex` produces the same result as using the anonymous function `sdf -> axes(sdf, 1)` that takes a `SubDataFrame` as its first argument and returns indices along its first axes. Importantly if it wasn't defined as a column-independent operation