From cd432c0298523f85e4f5507a078841e38bab1d5b Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Bogumi=C5=82=20Kami=C5=84ski?= <bkamins@sgh.waw.pl>
Date: Wed, 16 Nov 2022 19:51:53 +0100
Subject: [PATCH 01/13] explain context dependent expressions

---
 docs/src/man/split_apply_combine.md | 337 +++++++++++++++++++++++++++-
 src/abstractdataframe/selection.jl  |   4 +-
 2 files changed, 338 insertions(+), 3 deletions(-)

diff --git a/docs/src/man/split_apply_combine.md b/docs/src/man/split_apply_combine.md
index 5e34a2e175..56217f23de 100644
--- a/docs/src/man/split_apply_combine.md
+++ b/docs/src/man/split_apply_combine.md
@@ -67,7 +67,7 @@ each subset of the `DataFrame`. This specification can be of the following forms
    except `AsTable` are allowed).
 4. a `col => target_cols` pair, which renames the column `col` to `target_cols`, which
    must be single name (as a `Symbol` or a string), a vector of names or `AsTable`.
-5. special convenience forms `function => target_cols` or just `function`
+5. context dependent expressions `function => target_cols` or just `function`
    for specific `function`s where the input columns are omitted;
    without `target_cols` the new column has the same name as `function`, otherwise
    it must be single name (as a `Symbol` or a string). Supported `function`s are:
@@ -778,3 +778,338 @@ julia> df
    5 │     2  missing      5
    6 │     3  missing      6
 ```
+
+# Context dependent expressions
+
+Operation specification language supports the following context dependent
+operations:
+
+* getting the number of rows (`nrow`);
+* getting the proportion of rows (`proprow`);
+* getting the group number (`groupindices`);
+* getting a vector of group indices (`eachindex`).
+
+These operations are context dependent, because they do not require input column
+name in the operation specification syntax.
+
+These four exceptions to the standard operation specification syntax were
+introduced for user convenience as these operations are often needed in
+practice.
+
+Below each of them is explained by example.
+
+First create a data frame we will work with:
+
+```jldoctest sac
+julia> df = DataFrame(customer_id=["a", "b", "b", "b", "c", "c"],
+                      transaction_id=[12, 15, 19, 17, 13, 11],
+                      volume=[2, 3, 1, 4, 5, 9])
+6×3 DataFrame
+ Row │ customer_id  transaction_id  volume
+     │ String       Int64           Int64
+─────┼─────────────────────────────────────
+   1 │ a                        12       2
+   2 │ b                        15       3
+   3 │ b                        19       1
+   4 │ b                        17       4
+   5 │ c                        13       5
+   6 │ c                        11       9
+
+julia> gdf = groupby(df, :customer_id, sort=true);
+
+julia> show(gdf, allgroups=true)
+GroupedDataFrame with 3 groups based on key: customer_id
+Group 1 (1 row): customer_id = "a"
+ Row │ customer_id  transaction_id  volume
+     │ String       Int64           Int64
+─────┼─────────────────────────────────────
+   1 │ a                        12       2
+Group 2 (3 rows): customer_id = "b"
+ Row │ customer_id  transaction_id  volume
+     │ String       Int64           Int64
+─────┼─────────────────────────────────────
+   1 │ b                        15       3
+   2 │ b                        19       1
+   3 │ b                        17       4
+Group 3 (2 rows): customer_id = "c"
+ Row │ customer_id  transaction_id  volume
+     │ String       Int64           Int64
+─────┼─────────────────────────────────────
+   1 │ c                        13       5
+   2 │ c                        11       9
+```
+
+## Getting the number of rows
+
+You can get the number of rows per group in a `GroupedDataFrame` by just
+writing `nrow`, in which case the generated column name with the number of rows
+is `:nrow`:
+
+```jldoctest sac
+julia> combine(gdf, nrow)
+3×2 DataFrame
+ Row │ customer_id  nrow
+     │ String       Int64
+─────┼────────────────────
+   1 │ a                1
+   2 │ b                3
+   3 │ c                2
+```
+
+Additionally you are allowed to pass target column name:
+
+```jldoctest sac
+julia> combine(gdf, nrow => "transaction_count")
+3×2 DataFrame
+ Row │ customer_id  transaction_count
+     │ String       Int64
+─────┼────────────────────────────────
+   1 │ a                            1
+   2 │ b                            3
+   3 │ c                            2
+```
+
+Note that in both cases we did not pass source column name as it is not needed
+to determine the number of rows per group. This is the reason why context
+dependent expressions are exceptions to standard operation specification syntax.
+
+Additionally the `nrow` expression also works in operation specification syntax
+applied to a data frame. Here is an example:
+
+```jldoctest sac
+julia> combine(df, nrow => "transaction_count")
+1×1 DataFrame
+ Row │ transaction_count
+     │ Int64
+─────┼───────────────────
+   1 │                 6
+```
+
+Finally, recall that [`nrow`](@ref) is also a regular function that returns a
+number of rows in a data frame:
+
+
+```jldoctest sac
+julia> nrow(df)
+6
+```
+
+This dual-use of `nrow` does not lead to ambiguities, and is meant to make it
+easier to remember this exception.
+
+## Getting the proportion of rows
+
+If you want to get a proportion of rows per group in a `GroupedDataFrame`
+you can use the `proprow` and `proprow => [target column name]` context
+dependent expressions. Here are some examples:
+
+```jldoctest sac
+julia> combine(gdf, proprow)
+3×2 DataFrame
+ Row │ customer_id  proprow
+     │ String       Float64
+─────┼───────────────────────
+   1 │ a            0.166667
+   2 │ b            0.5
+   3 │ c            0.333333
+
+julia> combine(gdf, proprow => "transaction_fraction")
+3×2 DataFrame
+ Row │ customer_id  transaction_fraction
+     │ String       Float64
+─────┼───────────────────────────────────
+   1 │ a                        0.166667
+   2 │ b                        0.5
+   3 │ c                        0.333333
+```
+
+As opposed to `nrow`, `proprow` cannot be used outside of operation
+specification syntax and is only allowed when processing `GroupedDataFrame`.
+
+## Getting the group number
+
+Another common operation is getting group number. Use the `groupindices` and
+`groupindices => [target column name]` context dependent expressions to get it:
+
+
+```jldoctest sac
+julia> combine(gdf, groupindices)
+3×2 DataFrame
+ Row │ customer_id  groupindices
+     │ String       Int64
+─────┼───────────────────────────
+   1 │ a                       1
+   2 │ b                       2
+   3 │ c                       3
+
+julia> combine(gdf, groupindices => "group_number")
+3×2 DataFrame
+ Row │ customer_id  group_number
+     │ String       Int64
+─────┼───────────────────────────
+   1 │ a                       1
+   2 │ b                       2
+   3 │ c                       3
+```
+
+The `groupindices` name was chosen, because there exists the
+[`groupindices`](@ref) function that applied to `GroupedDataFrame` returns
+group indices for each row in the parent data frame of the passed
+`GroupedDataFrame`:
+
+```jldoctest sac
+julia> groupindices(gdf)
+6-element Vector{Union{Missing, Int64}}:
+ 1
+ 2
+ 2
+ 2
+ 3
+ 3
+```
+
+So as for `nrow` we see that the result is similar, but just in a different
+context (normal function call vs. operation specification syntax).
+
+## Getting a vector of group indices
+
+The last context dependent expression supported by operation is getting group
+indices. Use the `eachindex` and `eachindex => [target column name]` expressions
+to get it:
+
+
+```jldoctest sac
+julia> combine(gdf, eachindex)
+6×2 DataFrame
+ Row │ customer_id  eachindex
+     │ String       Int64
+─────┼────────────────────────
+   1 │ a                    1
+   2 │ b                    1
+   3 │ b                    2
+   4 │ b                    3
+   5 │ c                    1
+   6 │ c                    2
+
+julia> combine(gdf, eachindex => "transaction_number")
+6×2 DataFrame
+ Row │ customer_id  transaction_number
+     │ String       Int64
+─────┼─────────────────────────────────
+   1 │ a                             1
+   2 │ b                             1
+   3 │ b                             2
+   4 │ b                             3
+   5 │ c                             1
+   6 │ c                             2
+```
+
+Note that this operation also makes sense in a data frame context so it is
+also supported:
+
+```jldoctest sac
+julia> transform(df, eachindex)
+6×4 DataFrame
+ Row │ customer_id  transaction_id  volume  eachindex
+     │ String       Int64           Int64   Int64
+─────┼────────────────────────────────────────────────
+   1 │ a                        12       2          1
+   2 │ b                        15       3          2
+   3 │ b                        19       1          3
+   4 │ b                        17       4          4
+   5 │ c                        13       5          5
+   6 │ c                        11       9          6
+```
+
+Finally recall that `eachindex` is a standard function for getting all indices
+in an array. This similarity of functionality was the reason why this name was
+picked:
+
+```jldoctest sac
+julia> collect(eachindex(df.customer_id))
+6-element Vector{Int64}:
+ 1
+ 2
+ 3
+ 4
+ 5
+ 6
+```
+
+This, for example, means that in the following example the two created columns
+have the same contents:
+
+```jldoctest sac
+julia> combine(gdf, eachindex, :customer_id => eachindex)
+6×3 DataFrame
+ Row │ customer_id  eachindex  customer_id_eachindex
+     │ String       Int64      Int64
+─────┼───────────────────────────────────────────────
+   1 │ a                    1                      1
+   2 │ b                    1                      1
+   3 │ b                    2                      2
+   4 │ b                    3                      3
+   5 │ c                    1                      1
+   6 │ c                    2                      2
+```
+
+
+## Passing a function in operation specification syntax
+
+When discussing context dependent expressions it is important to remember
+that operation specification syntax allows you to pass a function (without
+source and target column names), in which case such a function get a
+`SubDataFrame` that represents a group in a `GroupedDataFrame`. Here is an
+example:
+
+```jldoctest sac
+julia> combine(gdf, nrow, x -> nrow(x))
+3×3 DataFrame
+ Row │ customer_id  nrow   x1
+     │ String       Int64  Int64
+─────┼───────────────────────────
+   1 │ a                1      1
+   2 │ b                3      3
+   3 │ c                2      2
+```
+
+Notice that columns `:nrow` and `:x1` have an identical contents. This is
+expected. We already know that `nrow` is a context dependent expression
+generating the `:nrow` column with number of rows per group. However, the
+`x -> nrow(x)` anonymous function does exactly the same as it gets a
+`SubDataFrame` as its argument and returns its number of rows (the `:x1` column
+name is a default auto-generated column name in this case).
+
+To show you another example of passing a function consider the following case:
+
+```jldoctest sac
+julia> combine(gdf, :volume => sum, x -> sum(x.volume))
+3×3 DataFrame
+ Row │ customer_id  volume_sum  x1
+     │ String       Int64       Int64
+─────┼────────────────────────────────
+   1 │ a                     2      2
+   2 │ b                     8      8
+   3 │ c                    14     14
+```
+
+Again, both `:volume_sum` and `:x1` columns hold the same data. The reason
+is that in `:volume => sum` we just apply the `sum` function to the `:volume`
+column, while in `x -> sum(x.volume`, `x` variable is a `SubDataFrame`
+representing the whole group.
+
+Passing a function taking a `SubDataFrame` is a flexible functionality allowing
+you to perform complex operations on your data. However, you should bear in mind
+two aspects:
+
+* Using full operation specification syntax (where source and target column
+  names are passe) will lead to faster execution of your code (as Julia
+  compiler is able to better optimize execution of such operations) in
+  comparison to just passing a function taking a `SubDataFrame`.
+* Although writing `row`, `proprow`, `groupindices`, and `eachindex` looks like
+  just passing a function they **do not** take a `SubDataFrame` as their
+  argument. As we explained in this section, they are special context dependent
+  expressions that are exceptions to the standard operation specification syntax
+  rules. They were added for user convenience (and at the same time they are
+  optimized to be fast).
+
diff --git a/src/abstractdataframe/selection.jl b/src/abstractdataframe/selection.jl
index 9e32989c68..68bf4cd313 100644
--- a/src/abstractdataframe/selection.jl
+++ b/src/abstractdataframe/selection.jl
@@ -75,7 +75,7 @@ const TRANSFORMATION_COMMON_RULES =
        except `AsTable` are allowed).
     4. a `col => target_cols` pair, which renames the column `col` to `target_cols`, which
        must be single name (as a `Symbol` or a string), a vector of names or `AsTable`.
-    5. special convenience forms `function => target_cols` or just `function`
+    5. context dependent expressions `function => target_cols` or just `function`
        for specific `function`s where the input columns are omitted;
        without `target_cols` the new column has the same name as `function`, otherwise
        it must be single name (as a `Symbol` or a string). Supported `function`s are:
@@ -1267,7 +1267,7 @@ julia> select(gd, :, AsTable(Not(:a)) => sum, renamecols=false)
    8 │     2      1      8      9
 ```
 
-# special convenience transformations
+# context dependent expressions
 ```jldoctest
 julia> df = DataFrame(a=[1, 1, 1, 2, 2, 1, 1, 2],
                       b=repeat([2, 1], outer=[4]),

From 264ec2d55bd2ceeccd64201d901484892ff5eaab Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Bogumi=C5=82=20Kami=C5=84ski?= <bkamins@sgh.waw.pl>
Date: Thu, 17 Nov 2022 14:32:29 +0100
Subject: [PATCH 02/13] explain that GroupedDataFrame is indexable and iterable

---
 docs/src/man/split_apply_combine.md | 331 ++++++++++++++++++----------
 1 file changed, 218 insertions(+), 113 deletions(-)

diff --git a/docs/src/man/split_apply_combine.md b/docs/src/man/split_apply_combine.md
index 56217f23de..6dc5d238b4 100644
--- a/docs/src/man/split_apply_combine.md
+++ b/docs/src/man/split_apply_combine.md
@@ -1,5 +1,7 @@
 # The Split-Apply-Combine Strategy
 
+## Design of the split-apply-combine support
+
 Many data analysis tasks involve three steps:
 1. splitting a data set into groups,
 2. applying some functions to each of the groups,
@@ -186,6 +188,8 @@ for details):
 - `threads` : whether transformations may be run in separate tasks which can execute
   in parallel
 
+## Examples of the split-apply-combine operations
+
 We show several examples of these functions applied to the `iris` dataset below:
 
 ```jldoctest sac
@@ -385,7 +389,134 @@ julia> combine(gdf) do df
    3 │ Iris-virginica     5.552  0.304588
 ```
 
-If you only want to split the data set into subsets, use the [`groupby`](@ref) function:
+To apply a function to each non-grouping column of a `GroupedDataFrame` you can write:
+
+```jldoctest sac
+julia> gd = groupby(iris, :Species)
+GroupedDataFrame with 3 groups based on key: Species
+First Group (50 rows): Species = "Iris-setosa"
+ Row │ SepalLength  SepalWidth  PetalLength  PetalWidth  Species
+     │ Float64      Float64     Float64      Float64     String15
+─────┼───────────────────────────────────────────────────────────────
+   1 │         5.1         3.5          1.4         0.2  Iris-setosa
+   2 │         4.9         3.0          1.4         0.2  Iris-setosa
+  ⋮  │      ⋮           ⋮            ⋮           ⋮            ⋮
+  49 │         5.3         3.7          1.5         0.2  Iris-setosa
+  50 │         5.0         3.3          1.4         0.2  Iris-setosa
+                                                      46 rows omitted
+⋮
+Last Group (50 rows): Species = "Iris-virginica"
+ Row │ SepalLength  SepalWidth  PetalLength  PetalWidth  Species
+     │ Float64      Float64     Float64      Float64     String15
+─────┼──────────────────────────────────────────────────────────────────
+   1 │         6.3         3.3          6.0         2.5  Iris-virginica
+   2 │         5.8         2.7          5.1         1.9  Iris-virginica
+  ⋮  │      ⋮           ⋮            ⋮           ⋮             ⋮
+  50 │         5.9         3.0          5.1         1.8  Iris-virginica
+                                                         47 rows omitted
+
+julia> combine(gd, valuecols(gd) .=> mean)
+3×5 DataFrame
+ Row │ Species          SepalLength_mean  SepalWidth_mean  PetalLength_mean  P ⋯
+     │ String15         Float64           Float64          Float64           F ⋯
+─────┼──────────────────────────────────────────────────────────────────────────
+   1 │ Iris-setosa                 5.006            3.418             1.464    ⋯
+   2 │ Iris-versicolor             5.936            2.77              4.26
+   3 │ Iris-virginica              6.588            2.974             5.552
+                                                                1 column omitted
+```
+
+Note that `GroupedDataFrame` is a view: therefore
+grouping columns of its parent data frame must not be mutated, and
+rows must not be added nor removed from it. If the number or rows
+of the parent changes then an error is thrown when a child `GroupedDataFrame`
+is used:
+```jldoctest sac
+julia> df = DataFrame(id=1:2)
+2×1 DataFrame
+ Row │ id
+     │ Int64
+─────┼───────
+   1 │     1
+   2 │     2
+
+julia> gd = groupby(df, :id)
+GroupedDataFrame with 2 groups based on key: id
+First Group (1 row): id = 1
+ Row │ id
+     │ Int64
+─────┼───────
+   1 │     1
+⋮
+Last Group (1 row): id = 2
+ Row │ id
+     │ Int64
+─────┼───────
+   1 │     2
+
+julia> push!(df, [3])
+3×1 DataFrame
+ Row │ id
+     │ Int64
+─────┼───────
+   1 │     1
+   2 │     2
+   3 │     3
+
+julia> gd[1]
+ERROR: AssertionError: The current number of rows in the parent data frame is 3 and it does not match the number of rows it contained when GroupedDataFrame was created which was 2. The number of rows in the parent data frame has likely been changed unintentionally (e.g. using subset!, filter!, deleteat!, push!, or append! functions).
+```
+
+Sometimes it is useful to append rows to the source data frame of a
+`GroupedDataFrame`, without affecting the rows used for grouping.
+In such a scenario you can create the grouped data frame using a `view`
+of the parent data frame to avoid the error:
+
+```jldoctest sac
+julia> df = DataFrame(id=1:2)
+2×1 DataFrame
+ Row │ id
+     │ Int64
+─────┼───────
+   1 │     1
+   2 │     2
+
+julia> gd = groupby(view(df, :, :), :id)
+GroupedDataFrame with 2 groups based on key: id
+First Group (1 row): id = 1
+ Row │ id
+     │ Int64
+─────┼───────
+   1 │     1
+⋮
+Last Group (1 row): id = 2
+ Row │ id
+     │ Int64
+─────┼───────
+   1 │     2
+
+julia> push!(df, [3])
+3×1 DataFrame
+ Row │ id
+     │ Int64
+─────┼───────
+   1 │     1
+   2 │     2
+   3 │     3
+
+julia> gd[1]
+1×1 SubDataFrame
+ Row │ id
+     │ Int64
+─────┼───────
+   1 │     1
+```
+
+## Using `GroupedDataFrame` as an itrable and indexable object
+
+If you only want to split the data set into subsets, use the [`groupby`](@ref)
+function. You can then iterate `SubDataFrame`s that constitute the identified
+groups:
 
 ```jldoctest sac
 julia> for subdf in groupby(iris, :Species)
@@ -494,129 +625,103 @@ Last Group (5 rows): g = 501
    5 │   501   2505
 ```
 
-In order to apply a function to each non-grouping column of a `GroupedDataFrame` you can write:
+Note that although `GroupedDataFrame` is iterable and indexable it is not an
+`AbstractVector`. For this reason currently it was designed that it does not
+support `map` nor broadcasting (to allow for making a decision in the future
+what result type they should produce). To apply a function to all groups of a
+data frame and get a vector of results either use a comprehension or `collect`
+`GroupedDataFrame` into a vector first. Here are examples of both approaches:
+
 ```jldoctest sac
-julia> gd = groupby(iris, :Species)
-GroupedDataFrame with 3 groups based on key: Species
-First Group (50 rows): Species = "Iris-setosa"
- Row │ SepalLength  SepalWidth  PetalLength  PetalWidth  Species
-     │ Float64      Float64     Float64      Float64     String15
+julia> [nrow(sdf) for sdf in gd]
+3-element Vector{Int64}:
+ 50
+ 50
+ 50
+
+julia> sdf_vec = collect(gd)
+3-element Vector{Any}:
+ 50×5 SubDataFrame
+ Row │ SepalLength  SepalWidth  PetalLength  PetalWidth  Species     
+     │ Float64      Float64     Float64      Float64     String15    
 ─────┼───────────────────────────────────────────────────────────────
    1 │         5.1         3.5          1.4         0.2  Iris-setosa
    2 │         4.9         3.0          1.4         0.2  Iris-setosa
+   3 │         4.7         3.2          1.3         0.2  Iris-setosa
+   4 │         4.6         3.1          1.5         0.2  Iris-setosa
+   5 │         5.0         3.6          1.4         0.2  Iris-setosa
+   6 │         5.4         3.9          1.7         0.4  Iris-setosa
+   7 │         4.6         3.4          1.4         0.3  Iris-setosa
+   8 │         5.0         3.4          1.5         0.2  Iris-setosa
   ⋮  │      ⋮           ⋮            ⋮           ⋮            ⋮
+  44 │         5.0         3.5          1.6         0.6  Iris-setosa
+  45 │         5.1         3.8          1.9         0.4  Iris-setosa
+  46 │         4.8         3.0          1.4         0.3  Iris-setosa
+  47 │         5.1         3.8          1.6         0.2  Iris-setosa
+  48 │         4.6         3.2          1.4         0.2  Iris-setosa
   49 │         5.3         3.7          1.5         0.2  Iris-setosa
   50 │         5.0         3.3          1.4         0.2  Iris-setosa
-                                                      46 rows omitted
-⋮
-Last Group (50 rows): Species = "Iris-virginica"
+                                                      35 rows omitted
+ 50×5 SubDataFrame
  Row │ SepalLength  SepalWidth  PetalLength  PetalWidth  Species
-     │ Float64      Float64     Float64      Float64     String15
+     │ Float64      Float64     Float64      Float64     String15        
+─────┼───────────────────────────────────────────────────────────────────
+   1 │         7.0         3.2          4.7         1.4  Iris-versicolor
+   2 │         6.4         3.2          4.5         1.5  Iris-versicolor
+   3 │         6.9         3.1          4.9         1.5  Iris-versicolor
+   4 │         5.5         2.3          4.0         1.3  Iris-versicolor
+   5 │         6.5         2.8          4.6         1.5  Iris-versicolor
+   6 │         5.7         2.8          4.5         1.3  Iris-versicolor
+   7 │         6.3         3.3          4.7         1.6  Iris-versicolor
+   8 │         4.9         2.4          3.3         1.0  Iris-versicolor
+  ⋮  │      ⋮           ⋮            ⋮           ⋮              ⋮
+  44 │         5.0         2.3          3.3         1.0  Iris-versicolor
+  45 │         5.6         2.7          4.2         1.3  Iris-versicolor
+  46 │         5.7         3.0          4.2         1.2  Iris-versicolor
+  47 │         5.7         2.9          4.2         1.3  Iris-versicolor
+  48 │         6.2         2.9          4.3         1.3  Iris-versicolor
+  49 │         5.1         2.5          3.0         1.1  Iris-versicolor
+  50 │         5.7         2.8          4.1         1.3  Iris-versicolor
+                                                          35 rows omitted
+ 50×5 SubDataFrame
+ Row │ SepalLength  SepalWidth  PetalLength  PetalWidth  Species        
+     │ Float64      Float64     Float64      Float64     String15       
 ─────┼──────────────────────────────────────────────────────────────────
    1 │         6.3         3.3          6.0         2.5  Iris-virginica
    2 │         5.8         2.7          5.1         1.9  Iris-virginica
+   3 │         7.1         3.0          5.9         2.1  Iris-virginica
+   4 │         6.3         2.9          5.6         1.8  Iris-virginica
+   5 │         6.5         3.0          5.8         2.2  Iris-virginica
+   6 │         7.6         3.0          6.6         2.1  Iris-virginica
+   7 │         4.9         2.5          4.5         1.7  Iris-virginica
+   8 │         7.3         2.9          6.3         1.8  Iris-virginica
   ⋮  │      ⋮           ⋮            ⋮           ⋮             ⋮
+  44 │         6.8         3.2          5.9         2.3  Iris-virginica
+  45 │         6.7         3.3          5.7         2.5  Iris-virginica
+  46 │         6.7         3.0          5.2         2.3  Iris-virginica
+  47 │         6.3         2.5          5.0         1.9  Iris-virginica
+  48 │         6.5         3.0          5.2         2.0  Iris-virginica
+  49 │         6.2         3.4          5.4         2.3  Iris-virginica
   50 │         5.9         3.0          5.1         1.8  Iris-virginica
-                                                         47 rows omitted
-
-julia> combine(gd, valuecols(gd) .=> mean)
-3×5 DataFrame
- Row │ Species          SepalLength_mean  SepalWidth_mean  PetalLength_mean  P ⋯
-     │ String15         Float64           Float64          Float64           F ⋯
-─────┼──────────────────────────────────────────────────────────────────────────
-   1 │ Iris-setosa                 5.006            3.418             1.464    ⋯
-   2 │ Iris-versicolor             5.936            2.77              4.26
-   3 │ Iris-virginica              6.588            2.974             5.552
-                                                                1 column omitted
-```
-
-Note that `GroupedDataFrame` is a view: therefore
-grouping columns of its parent data frame must not be mutated, and
-rows must not be added nor removed from it. If the number or rows
-of the parent changes then an error is thrown when a child `GroupedDataFrame`
-is used:
-```jldoctest sac
-julia> df = DataFrame(id=1:2)
-2×1 DataFrame
- Row │ id
-     │ Int64
-─────┼───────
-   1 │     1
-   2 │     2
-
-julia> gd = groupby(df, :id)
-GroupedDataFrame with 2 groups based on key: id
-First Group (1 row): id = 1
- Row │ id
-     │ Int64
-─────┼───────
-   1 │     1
-⋮
-Last Group (1 row): id = 2
- Row │ id
-     │ Int64
-─────┼───────
-   1 │     2
-
-julia> push!(df, [3])
-3×1 DataFrame
- Row │ id
-     │ Int64
-─────┼───────
-   1 │     1
-   2 │     2
-   3 │     3
-
-julia> gd[1]
-ERROR: AssertionError: The current number of rows in the parent data frame is 3 and it does not match the number of rows it contained when GroupedDataFrame was created which was 2. The number of rows in the parent data frame has likely been changed unintentionally (e.g. using subset!, filter!, deleteat!, push!, or append! functions).
+                                                         35 rows omitted
+
+julia> map(nrow, sdf_vec)
+3-element Vector{Int64}:
+ 50
+ 50
+ 50
+
+julia> nrow.(sdf_vec)
+3-element Vector{Int64}:
+ 50
+ 50
+ 50
 ```
 
-Sometimes it is useful to append rows to the source data frame of a
-`GroupedDataFrame`, without affecting the rows used for grouping.
-In such a scenario you can create the grouped data frame using a `view`
-of the parent data frame to avoid the error:
-
-```jldoctest sac
-julia> df = DataFrame(id=1:2)
-2×1 DataFrame
- Row │ id
-     │ Int64
-─────┼───────
-   1 │     1
-   2 │     2
-
-julia> gd = groupby(view(df, :, :), :id)
-GroupedDataFrame with 2 groups based on key: id
-First Group (1 row): id = 1
- Row │ id
-     │ Int64
-─────┼───────
-   1 │     1
-⋮
-Last Group (1 row): id = 2
- Row │ id
-     │ Int64
-─────┼───────
-   1 │     2
-
-julia> push!(df, [3])
-3×1 DataFrame
- Row │ id
-     │ Int64
-─────┼───────
-   1 │     1
-   2 │     2
-   3 │     3
-
-julia> gd[1]
-1×1 SubDataFrame
- Row │ id
-     │ Int64
-─────┼───────
-   1 │     1
-```
+Note, that using split-apply-combine strategy with operation specification
+syntax usually will be faster than iterating a `GroupedDataFrame`.
 
-# Simulating the SQL `where` clause
+## Simulating the SQL `where` clause
 
 You can conveniently work on subsets of a data frame by using `SubDataFrame`s.
 Operations performed on such objects can either create a new data frame or be
@@ -779,7 +884,7 @@ julia> df
    6 │     3  missing      6
 ```
 
-# Context dependent expressions
+## Context dependent expressions
 
 Operation specification language supports the following context dependent
 operations:
@@ -897,7 +1002,7 @@ julia> nrow(df)
 This dual-use of `nrow` does not lead to ambiguities, and is meant to make it
 easier to remember this exception.
 
-## Getting the proportion of rows
+### Getting the proportion of rows
 
 If you want to get a proportion of rows per group in a `GroupedDataFrame`
 you can use the `proprow` and `proprow => [target column name]` context
@@ -926,7 +1031,7 @@ julia> combine(gdf, proprow => "transaction_fraction")
 As opposed to `nrow`, `proprow` cannot be used outside of operation
 specification syntax and is only allowed when processing `GroupedDataFrame`.
 
-## Getting the group number
+### Getting the group number
 
 Another common operation is getting group number. Use the `groupindices` and
 `groupindices => [target column name]` context dependent expressions to get it:
@@ -971,7 +1076,7 @@ julia> groupindices(gdf)
 So as for `nrow` we see that the result is similar, but just in a different
 context (normal function call vs. operation specification syntax).
 
-## Getting a vector of group indices
+### Getting a vector of group indices
 
 The last context dependent expression supported by operation is getting group
 indices. Use the `eachindex` and `eachindex => [target column name]` expressions
@@ -1054,7 +1159,7 @@ julia> combine(gdf, eachindex, :customer_id => eachindex)
 ```
 
 
-## Passing a function in operation specification syntax
+### Passing a function in operation specification syntax
 
 When discussing context dependent expressions it is important to remember
 that operation specification syntax allows you to pass a function (without

From b6f48e084f78160b31c407de72d7c97cd8fa447a Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Bogumi=C5=82=20Kami=C5=84ski?= <bkamins@sgh.waw.pl>
Date: Thu, 17 Nov 2022 14:35:27 +0100
Subject: [PATCH 03/13] fix typo

---
 docs/src/man/split_apply_combine.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/src/man/split_apply_combine.md b/docs/src/man/split_apply_combine.md
index 6dc5d238b4..684e6cc260 100644
--- a/docs/src/man/split_apply_combine.md
+++ b/docs/src/man/split_apply_combine.md
@@ -626,7 +626,7 @@ Last Group (5 rows): g = 501
 ```
 
 Note that although `GroupedDataFrame` is iterable and indexable it is not an
-`AbstractVector`. For this reason currently it was designed that it does not
+`AbstractVector`. For this reason currently it was decided that it does not
 support `map` nor broadcasting (to allow for making a decision in the future
 what result type they should produce). To apply a function to all groups of a
 data frame and get a vector of results either use a comprehension or `collect`

From b4560f3c171048ff41107b7e78492c82a8be163b Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Bogumi=C5=82=20Kami=C5=84ski?= <bkamins@sgh.waw.pl>
Date: Thu, 17 Nov 2022 21:20:57 +0100
Subject: [PATCH 04/13] define gd properly

---
 docs/src/man/split_apply_combine.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/docs/src/man/split_apply_combine.md b/docs/src/man/split_apply_combine.md
index 684e6cc260..1991839413 100644
--- a/docs/src/man/split_apply_combine.md
+++ b/docs/src/man/split_apply_combine.md
@@ -633,6 +633,8 @@ data frame and get a vector of results either use a comprehension or `collect`
 `GroupedDataFrame` into a vector first. Here are examples of both approaches:
 
 ```jldoctest sac
+julia> gd = groupby(iris, :Species);
+
 julia> [nrow(sdf) for sdf in gd]
 3-element Vector{Int64}:
  50

From 7d8e5e880efbfac78470528cd00a405c93037a11 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Bogumi=C5=82=20Kami=C5=84ski?= <bkamins@sgh.waw.pl>
Date: Mon, 21 Nov 2022 19:56:30 +0100
Subject: [PATCH 05/13] Apply suggestions from code review

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
---
 docs/src/man/split_apply_combine.md | 70 ++++++++++++++---------------
 src/abstractdataframe/selection.jl  |  4 +-
 2 files changed, 35 insertions(+), 39 deletions(-)

diff --git a/docs/src/man/split_apply_combine.md b/docs/src/man/split_apply_combine.md
index 1991839413..0fd278b9bd 100644
--- a/docs/src/man/split_apply_combine.md
+++ b/docs/src/man/split_apply_combine.md
@@ -69,7 +69,7 @@ each subset of the `DataFrame`. This specification can be of the following forms
    except `AsTable` are allowed).
 4. a `col => target_cols` pair, which renames the column `col` to `target_cols`, which
    must be single name (as a `Symbol` or a string), a vector of names or `AsTable`.
-5. context dependent expressions `function => target_cols` or just `function`
+5. context-dependent expressions `function => target_cols` or just `function`
    for specific `function`s where the input columns are omitted;
    without `target_cols` the new column has the same name as `function`, otherwise
    it must be single name (as a `Symbol` or a string). Supported `function`s are:
@@ -512,7 +512,7 @@ julia> gd[1]
    1 │     1
 ```
 
-## Using `GroupedDataFrame` as an itrable and indexable object
+## Using `GroupedDataFrame` as an iterable and indexable object
 
 If you only want to split the data set into subsets, use the [`groupby`](@ref)
 function. You can then iterate `SubDataFrame`s that constitute the identified
@@ -720,8 +720,9 @@ julia> nrow.(sdf_vec)
  50
 ```
 
-Note, that using split-apply-combine strategy with operation specification
-syntax usually will be faster than iterating a `GroupedDataFrame`.
+Note that using the split-apply-combine strategy with operation specification
+syntax in `combine`, `select` or `transform` will usually be faster than iterating
+a `GroupedDataFrame`.
 
 ## Simulating the SQL `where` clause
 
@@ -886,17 +887,17 @@ julia> df
    6 │     3  missing      6
 ```
 
-## Context dependent expressions
+## Context-dependent expressions
 
-Operation specification language supports the following context dependent
-operations:
+The operation specification language used with `combine`, `select` and `transform`
+supports the following context-dependent operations:
 
-* getting the number of rows (`nrow`);
-* getting the proportion of rows (`proprow`);
+* getting the number of rows in a group (`nrow`);
+* getting the proportion of rows in a group (`proprow`);
 * getting the group number (`groupindices`);
 * getting a vector of group indices (`eachindex`).
 
-These operations are context dependent, because they do not require input column
+These operations are context-dependent, because they do not require specifying the input column
 name in the operation specification syntax.
 
 These four exceptions to the standard operation specification syntax were
@@ -977,10 +978,10 @@ julia> combine(gdf, nrow => "transaction_count")
 ```
 
 Note that in both cases we did not pass source column name as it is not needed
-to determine the number of rows per group. This is the reason why context
-dependent expressions are exceptions to standard operation specification syntax.
+to determine the number of rows per group. This is the reason why context-dependent
+expressions are exceptions to standard operation specification syntax.
 
-Additionally the `nrow` expression also works in operation specification syntax
+The `nrow` expression also works in the operation specification syntax
 applied to a data frame. Here is an example:
 
 ```jldoctest sac
@@ -1001,14 +1002,14 @@ julia> nrow(df)
 6
 ```
 
-This dual-use of `nrow` does not lead to ambiguities, and is meant to make it
+This dual use of `nrow` does not lead to ambiguities, and is meant to make it
 easier to remember this exception.
 
 ### Getting the proportion of rows
 
 If you want to get a proportion of rows per group in a `GroupedDataFrame`
-you can use the `proprow` and `proprow => [target column name]` context
-dependent expressions. Here are some examples:
+you can use the `proprow` and `proprow => [target column name]` context-dependent
+expressions. Here are some examples:
 
 ```jldoctest sac
 julia> combine(gdf, proprow)
@@ -1030,13 +1031,13 @@ julia> combine(gdf, proprow => "transaction_fraction")
    3 │ c                        0.333333
 ```
 
-As opposed to `nrow`, `proprow` cannot be used outside of operation
-specification syntax and is only allowed when processing `GroupedDataFrame`.
+As opposed to `nrow`, `proprow` cannot be used outside of the operation
+specification syntax and is only allowed when processing a `GroupedDataFrame`.
 
 ### Getting the group number
 
 Another common operation is getting group number. Use the `groupindices` and
-`groupindices => [target column name]` context dependent expressions to get it:
+`groupindices => [target column name]` context-dependent expressions to get it:
 
 
 ```jldoctest sac
@@ -1059,10 +1060,9 @@ julia> combine(gdf, groupindices => "group_number")
    3 │ c                       3
 ```
 
-The `groupindices` name was chosen, because there exists the
-[`groupindices`](@ref) function that applied to `GroupedDataFrame` returns
-group indices for each row in the parent data frame of the passed
-`GroupedDataFrame`:
+Outside of the operation specification syntax, [`groupindices`](@ref)
+is also a regular function which returns group indices for each row
+in the parent data frame of the passed `GroupedDataFrame`:
 
 ```jldoctest sac
 julia> groupindices(gdf)
@@ -1075,14 +1075,10 @@ julia> groupindices(gdf)
  3
 ```
 
-So as for `nrow` we see that the result is similar, but just in a different
-context (normal function call vs. operation specification syntax).
+### Getting a vector of indices within groups
 
-### Getting a vector of group indices
-
-The last context dependent expression supported by operation is getting group
-indices. Use the `eachindex` and `eachindex => [target column name]` expressions
-to get it:
+The last context-dependent expression supported by the operation
+specification syntax is getting the index of each row within each group:
 
 
 ```jldoctest sac
@@ -1111,8 +1107,8 @@ julia> combine(gdf, eachindex => "transaction_number")
    6 │ c                             2
 ```
 
-Note that this operation also makes sense in a data frame context so it is
-also supported:
+Note that this operation also makes sense in a data frame context,
+where all rows are considered to be in the same group:
 
 ```jldoctest sac
 julia> transform(df, eachindex)
@@ -1161,11 +1157,11 @@ julia> combine(gdf, eachindex, :customer_id => eachindex)
 ```
 
 
-### Passing a function in operation specification syntax
+## Context-dependent expressions versus functions
 
 When discussing context dependent expressions it is important to remember
 that operation specification syntax allows you to pass a function (without
-source and target column names), in which case such a function get a
+source and target column names), in which case such a function gets passed a
 `SubDataFrame` that represents a group in a `GroupedDataFrame`. Here is an
 example:
 
@@ -1209,13 +1205,13 @@ Passing a function taking a `SubDataFrame` is a flexible functionality allowing
 you to perform complex operations on your data. However, you should bear in mind
 two aspects:
 
-* Using full operation specification syntax (where source and target column
-  names are passe) will lead to faster execution of your code (as Julia
+* Using the full operation specification syntax (where source and target column
+  names are passed) will lead to faster execution of your code (as the Julia
   compiler is able to better optimize execution of such operations) in
   comparison to just passing a function taking a `SubDataFrame`.
 * Although writing `row`, `proprow`, `groupindices`, and `eachindex` looks like
   just passing a function they **do not** take a `SubDataFrame` as their
-  argument. As we explained in this section, they are special context dependent
+  argument. As we explained in this section, they are special context-dependent
   expressions that are exceptions to the standard operation specification syntax
   rules. They were added for user convenience (and at the same time they are
   optimized to be fast).
diff --git a/src/abstractdataframe/selection.jl b/src/abstractdataframe/selection.jl
index 68bf4cd313..6cd4d4787d 100644
--- a/src/abstractdataframe/selection.jl
+++ b/src/abstractdataframe/selection.jl
@@ -75,7 +75,7 @@ const TRANSFORMATION_COMMON_RULES =
        except `AsTable` are allowed).
     4. a `col => target_cols` pair, which renames the column `col` to `target_cols`, which
        must be single name (as a `Symbol` or a string), a vector of names or `AsTable`.
-    5. context dependent expressions `function => target_cols` or just `function`
+    5. context-dependent expressions `function => target_cols` or just `function`
        for specific `function`s where the input columns are omitted;
        without `target_cols` the new column has the same name as `function`, otherwise
        it must be single name (as a `Symbol` or a string). Supported `function`s are:
@@ -1267,7 +1267,7 @@ julia> select(gd, :, AsTable(Not(:a)) => sum, renamecols=false)
    8 │     2      1      8      9
 ```
 
-# context dependent expressions
+# context-dependent expressions
 ```jldoctest
 julia> df = DataFrame(a=[1, 1, 1, 2, 2, 1, 1, 2],
                       b=repeat([2, 1], outer=[4]),

From 9acd38e88959e91307699e39baf563f1a5328909 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Bogumi=C5=82=20Kami=C5=84ski?= <bkamins@sgh.waw.pl>
Date: Mon, 21 Nov 2022 20:29:43 +0100
Subject: [PATCH 06/13] updates after code review

---
 docs/src/man/split_apply_combine.md | 311 +++++++++++++++-------------
 1 file changed, 162 insertions(+), 149 deletions(-)

diff --git a/docs/src/man/split_apply_combine.md b/docs/src/man/split_apply_combine.md
index 0fd278b9bd..0d9dcf95eb 100644
--- a/docs/src/man/split_apply_combine.md
+++ b/docs/src/man/split_apply_combine.md
@@ -220,7 +220,7 @@ julia> iris = CSV.read((joinpath(dirname(pathof(DataFrames)),
  150 │         5.9         3.0          5.1         1.8  Iris-virginica
                                                         135 rows omitted
 
-julia> gdf = groupby(iris, :Species)
+julia> iris_gdf = groupby(iris, :Species)
 GroupedDataFrame with 3 groups based on key: Species
 First Group (50 rows): Species = "Iris-setosa"
  Row │ SepalLength  SepalWidth  PetalLength  PetalWidth  Species
@@ -243,7 +243,7 @@ Last Group (50 rows): Species = "Iris-virginica"
   50 │         5.9         3.0          5.1         1.8  Iris-virginica
                                                          47 rows omitted
 
-julia> combine(gdf, :PetalLength => mean)
+julia> combine(iris_gdf, :PetalLength => mean)
 3×2 DataFrame
  Row │ Species          PetalLength_mean
      │ String15         Float64
@@ -252,7 +252,7 @@ julia> combine(gdf, :PetalLength => mean)
    2 │ Iris-versicolor             4.26
    3 │ Iris-virginica              5.552
 
-julia> combine(gdf, nrow, proprow, groupindices)
+julia> combine(iris_gdf, nrow, proprow, groupindices)
 3×4 DataFrame
  Row │ Species          nrow   proprow   groupindices
      │ String15         Int64  Float64   Int64
@@ -261,7 +261,7 @@ julia> combine(gdf, nrow, proprow, groupindices)
    2 │ Iris-versicolor     50  0.333333             2
    3 │ Iris-virginica      50  0.333333             3
 
-julia> combine(gdf, nrow, :PetalLength => mean => :mean)
+julia> combine(iris_gdf, nrow, :PetalLength => mean => :mean)
 3×3 DataFrame
  Row │ Species          nrow   mean
      │ String15         Int64  Float64
@@ -270,7 +270,9 @@ julia> combine(gdf, nrow, :PetalLength => mean => :mean)
    2 │ Iris-versicolor     50    4.26
    3 │ Iris-virginica      50    5.552
 
-julia> combine(gdf, [:PetalLength, :SepalLength] => ((p, s) -> (a=mean(p)/mean(s), b=sum(p))) =>
+julia> combine(iris_gdf,
+               [:PetalLength, :SepalLength] =>
+               ((p, s) -> (a=mean(p)/mean(s), b=sum(p))) =>
                AsTable) # multiple columns are passed as arguments
 3×3 DataFrame
  Row │ Species          a         b
@@ -280,7 +282,7 @@ julia> combine(gdf, [:PetalLength, :SepalLength] => ((p, s) -> (a=mean(p)/mean(s
    2 │ Iris-versicolor  0.717655    213.0
    3 │ Iris-virginica   0.842744    277.6
 
-julia> combine(gdf,
+julia> combine(iris_gdf,
                AsTable([:PetalLength, :SepalLength]) =>
                x -> std(x.PetalLength) / std(x.SepalLength)) # passing a NamedTuple
 3×2 DataFrame
@@ -291,7 +293,7 @@ julia> combine(gdf,
    2 │ Iris-versicolor                          0.910378
    3 │ Iris-virginica                           0.867923
 
-julia> combine(x -> std(x.PetalLength) / std(x.SepalLength), gdf) # passing a SubDataFrame
+julia> combine(x -> std(x.PetalLength) / std(x.SepalLength), iris_gdf) # passing a SubDataFrame
 3×2 DataFrame
  Row │ Species          x1
      │ String15         Float64
@@ -300,7 +302,7 @@ julia> combine(x -> std(x.PetalLength) / std(x.SepalLength), gdf) # passing a Su
    2 │ Iris-versicolor  0.910378
    3 │ Iris-virginica   0.867923
 
-julia> combine(gdf, 1:2 => cor, nrow)
+julia> combine(iris_gdf, 1:2 => cor, nrow)
 3×3 DataFrame
  Row │ Species          SepalLength_SepalWidth_cor  nrow
      │ String15         Float64                     Int64
@@ -309,7 +311,7 @@ julia> combine(gdf, 1:2 => cor, nrow)
    2 │ Iris-versicolor                    0.525911     50
    3 │ Iris-virginica                     0.457228     50
 
-julia> combine(gdf, :PetalLength => (x -> [extrema(x)]) => [:min, :max])
+julia> combine(iris_gdf, :PetalLength => (x -> [extrema(x)]) => [:min, :max])
 3×3 DataFrame
  Row │ Species          min      max
      │ String15         Float64  Float64
@@ -321,7 +323,7 @@ julia> combine(gdf, :PetalLength => (x -> [extrema(x)]) => [:min, :max])
 
 To get row number for each observation within each group use the `eachindex` function:
 ```
-julia> combine(gdf, eachindex)
+julia> combine(iris_gdf, eachindex)
 150×2 DataFrame
  Row │ Species         eachindex
      │ String15        Int64
@@ -342,7 +344,7 @@ In the example below
 the return values in columns `:SepalLength_SepalWidth_cor` and `:nrow` are
 broadcasted to match the number of elements in each group:
 ```
-julia> select(gdf, 1:2 => cor)
+julia> select(iris_gdf, 1:2 => cor)
 150×2 DataFrame
  Row │ Species         SepalLength_SepalWidth_cor
      │ String          Float64
@@ -357,7 +359,7 @@ julia> select(gdf, 1:2 => cor)
  150 │ Iris-virginica                    0.457228
                                   143 rows omitted
 
-julia> transform(gdf, :Species => x -> chop.(x, head=5, tail=0))
+julia> transform(iris_gdf, :Species => x -> chop.(x, head=5, tail=0))
 150×6 DataFrame
  Row │ SepalLength  SepalWidth  PetalLength  PetalWidth  Species         Species_function
      │ Float64      Float64     Float64      Float64     String          SubString…
@@ -377,7 +379,7 @@ All functions also support the `do` block form. However, as noted above,
 this form is slow and should therefore be avoided when performance matters.
 
 ```jldoctest sac
-julia> combine(gdf) do df
+julia> combine(iris_gdf) do df
            (m = mean(df.PetalLength), s² = var(df.PetalLength))
        end
 3×3 DataFrame
@@ -392,30 +394,7 @@ julia> combine(gdf) do df
 To apply a function to each non-grouping column of a `GroupedDataFrame` you can write:
 
 ```jldoctest sac
-julia> gd = groupby(iris, :Species)
-GroupedDataFrame with 3 groups based on key: Species
-First Group (50 rows): Species = "Iris-setosa"
- Row │ SepalLength  SepalWidth  PetalLength  PetalWidth  Species
-     │ Float64      Float64     Float64      Float64     String15
-─────┼───────────────────────────────────────────────────────────────
-   1 │         5.1         3.5          1.4         0.2  Iris-setosa
-   2 │         4.9         3.0          1.4         0.2  Iris-setosa
-  ⋮  │      ⋮           ⋮            ⋮           ⋮            ⋮
-  49 │         5.3         3.7          1.5         0.2  Iris-setosa
-  50 │         5.0         3.3          1.4         0.2  Iris-setosa
-                                                      46 rows omitted
-⋮
-Last Group (50 rows): Species = "Iris-virginica"
- Row │ SepalLength  SepalWidth  PetalLength  PetalWidth  Species
-     │ Float64      Float64     Float64      Float64     String15
-─────┼──────────────────────────────────────────────────────────────────
-   1 │         6.3         3.3          6.0         2.5  Iris-virginica
-   2 │         5.8         2.7          5.1         1.9  Iris-virginica
-  ⋮  │      ⋮           ⋮            ⋮           ⋮             ⋮
-  50 │         5.9         3.0          5.1         1.8  Iris-virginica
-                                                         47 rows omitted
-
-julia> combine(gd, valuecols(gd) .=> mean)
+julia> combine(iris_gdf, valuecols(iris_gdf) .=> mean)
 3×5 DataFrame
  Row │ Species          SepalLength_mean  SepalWidth_mean  PetalLength_mean  P ⋯
      │ String15         Float64           Float64          Float64           F ⋯
@@ -431,6 +410,7 @@ grouping columns of its parent data frame must not be mutated, and
 rows must not be added nor removed from it. If the number or rows
 of the parent changes then an error is thrown when a child `GroupedDataFrame`
 is used:
+
 ```jldoctest sac
 julia> df = DataFrame(id=1:2)
 2×1 DataFrame
@@ -519,7 +499,7 @@ function. You can then iterate `SubDataFrame`s that constitute the identified
 groups:
 
 ```jldoctest sac
-julia> for subdf in groupby(iris, :Species)
+julia> for subdf in iris_gdf
            println(size(subdf, 1))
        end
 50
@@ -531,7 +511,7 @@ To also get the values of the grouping columns along with each group, use the
 `pairs` function:
 
 ```jldoctest sac
-julia> for (key, subdf) in pairs(groupby(iris, :Species))
+julia> for (key, subdf) in pairs(iris_gdf)
            println("Number of data points for $(key.Species): $(nrow(subdf))")
        end
 Number of data points for Iris-setosa: 50
@@ -539,92 +519,6 @@ Number of data points for Iris-versicolor: 50
 Number of data points for Iris-virginica: 50
 ```
 
-The value of `key` in the previous example is a [`DataFrames.GroupKey`](@ref) object,
-which can be used in a similar fashion to a `NamedTuple`.
-
-Grouping a data frame using the `groupby` function can be seen as adding a lookup key
-to it. Such lookups can be performed efficiently by indexing the resulting
-`GroupedDataFrame` with a `Tuple` or `NamedTuple`:
-```jldoctest sac
-julia> df = DataFrame(g=repeat(1:1000, inner=5), x=1:5000)
-5000×2 DataFrame
-  Row │ g      x
-      │ Int64  Int64
-──────┼──────────────
-    1 │     1      1
-    2 │     1      2
-    3 │     1      3
-    4 │     1      4
-    5 │     1      5
-    6 │     2      6
-    7 │     2      7
-    8 │     2      8
-  ⋮   │   ⋮      ⋮
- 4994 │   999   4994
- 4995 │   999   4995
- 4996 │  1000   4996
- 4997 │  1000   4997
- 4998 │  1000   4998
- 4999 │  1000   4999
- 5000 │  1000   5000
-    4985 rows omitted
-
-julia> gdf = groupby(df, :g)
-GroupedDataFrame with 1000 groups based on key: g
-First Group (5 rows): g = 1
- Row │ g      x
-     │ Int64  Int64
-─────┼──────────────
-   1 │     1      1
-   2 │     1      2
-   3 │     1      3
-   4 │     1      4
-   5 │     1      5
-⋮
-Last Group (5 rows): g = 1000
- Row │ g      x
-     │ Int64  Int64
-─────┼──────────────
-   1 │  1000   4996
-   2 │  1000   4997
-   3 │  1000   4998
-   4 │  1000   4999
-   5 │  1000   5000
-
-julia> gdf[(g=500,)]
-5×2 SubDataFrame
- Row │ g      x
-     │ Int64  Int64
-─────┼──────────────
-   1 │   500   2496
-   2 │   500   2497
-   3 │   500   2498
-   4 │   500   2499
-   5 │   500   2500
-
-julia> gdf[[(500,), (501,)]]
-GroupedDataFrame with 2 groups based on key: g
-First Group (5 rows): g = 500
- Row │ g      x
-     │ Int64  Int64
-─────┼──────────────
-   1 │   500   2496
-   2 │   500   2497
-   3 │   500   2498
-   4 │   500   2499
-   5 │   500   2500
-⋮
-Last Group (5 rows): g = 501
- Row │ g      x
-     │ Int64  Int64
-─────┼──────────────
-   1 │   501   2501
-   2 │   501   2502
-   3 │   501   2503
-   4 │   501   2504
-   5 │   501   2505
-```
-
 Note that although `GroupedDataFrame` is iterable and indexable it is not an
 `AbstractVector`. For this reason currently it was decided that it does not
 support `map` nor broadcasting (to allow for making a decision in the future
@@ -633,15 +527,13 @@ data frame and get a vector of results either use a comprehension or `collect`
 `GroupedDataFrame` into a vector first. Here are examples of both approaches:
 
 ```jldoctest sac
-julia> gd = groupby(iris, :Species);
-
-julia> [nrow(sdf) for sdf in gd]
+julia> [nrow(sdf) for sdf in iris_gdf]
 3-element Vector{Int64}:
  50
  50
  50
 
-julia> sdf_vec = collect(gd)
+julia> sdf_vec = collect(iris_gdf)
 3-element Vector{Any}:
  50×5 SubDataFrame
  Row │ SepalLength  SepalWidth  PetalLength  PetalWidth  Species     
@@ -724,6 +616,121 @@ Note that using the split-apply-combine strategy with operation specification
 syntax in `combine`, `select` or `transform` will usually be faster than iterating
 a `GroupedDataFrame`.
 
+The value of `key` in the example above where we iterated `pairs(iris_gdf)`
+is a [`DataFrames.GroupKey`](@ref) object,
+which can be used in a similar fashion to a `NamedTuple`.
+
+Grouping a data frame using the `groupby` function can be seen as adding a
+lookup key to it. Such lookups can be performed efficiently by indexing the
+resulting `GroupedDataFrame` with [`DataFrames.GroupKey`](@ref) (as it was
+presented aboce) a `Tuple`, a `NamedTuple`, or a dictionary. Here are some
+more examples of such indexing.
+
+```jldoctest sac
+julia> df = DataFrame(g=repeat(1:1000, inner=5), x=1:5000)
+5000×2 DataFrame
+  Row │ g      x
+      │ Int64  Int64
+──────┼──────────────
+    1 │     1      1
+    2 │     1      2
+    3 │     1      3
+    4 │     1      4
+    5 │     1      5
+    6 │     2      6
+    7 │     2      7
+    8 │     2      8
+  ⋮   │   ⋮      ⋮
+ 4994 │   999   4994
+ 4995 │   999   4995
+ 4996 │  1000   4996
+ 4997 │  1000   4997
+ 4998 │  1000   4998
+ 4999 │  1000   4999
+ 5000 │  1000   5000
+    4985 rows omitted
+
+julia> gd = groupby(df, :g)
+GroupedDataFrame with 1000 groups based on key: g
+First Group (5 rows): g = 1
+ Row │ g      x
+     │ Int64  Int64
+─────┼──────────────
+   1 │     1      1
+   2 │     1      2
+   3 │     1      3
+   4 │     1      4
+   5 │     1      5
+⋮
+Last Group (5 rows): g = 1000
+ Row │ g      x
+     │ Int64  Int64
+─────┼──────────────
+   1 │  1000   4996
+   2 │  1000   4997
+   3 │  1000   4998
+   4 │  1000   4999
+   5 │  1000   5000
+
+julia> gd[(g=500,)] # a NamedTuple
+5×2 SubDataFrame
+ Row │ g      x
+     │ Int64  Int64
+─────┼──────────────
+   1 │   500   2496
+   2 │   500   2497
+   3 │   500   2498
+   4 │   500   2499
+   5 │   500   2500
+
+julia> gd[[(500,), (501,)]] # a vector of Tuples
+GroupedDataFrame with 2 groups based on key: g
+First Group (5 rows): g = 500
+ Row │ g      x
+     │ Int64  Int64
+─────┼──────────────
+   1 │   500   2496
+   2 │   500   2497
+   3 │   500   2498
+   4 │   500   2499
+   5 │   500   2500
+⋮
+Last Group (5 rows): g = 501
+ Row │ g      x
+     │ Int64  Int64
+─────┼──────────────
+   1 │   501   2501
+   2 │   501   2502
+   3 │   501   2503
+   4 │   501   2504
+   5 │   501   2505
+
+julia> key = keys(gd) |> last # first key in gd
+GroupKey: (g = 1000,)
+
+julia> gd[key]
+5×2 SubDataFrame
+ Row │ g      x     
+     │ Int64  Int64
+─────┼──────────────
+   1 │  1000   4996
+   2 │  1000   4997
+   3 │  1000   4998
+   4 │  1000   4999
+   5 │  1000   5000
+
+julia> gd[Dict("g" => 1000)] # a dictionary
+5×2 SubDataFrame
+ Row │ g      x     
+     │ Int64  Int64
+─────┼──────────────
+   1 │  1000   4996
+   2 │  1000   4997
+   3 │  1000   4998
+   4 │  1000   4999
+   5 │  1000   5000
+```
+
 ## Simulating the SQL `where` clause
 
 You can conveniently work on subsets of a data frame by using `SubDataFrame`s.
@@ -895,7 +902,7 @@ supports the following context-dependent operations:
 * getting the number of rows in a group (`nrow`);
 * getting the proportion of rows in a group (`proprow`);
 * getting the group number (`groupindices`);
-* getting a vector of group indices (`eachindex`).
+* getting a vector of indices within groups (`eachindex`).
 
 These operations are context-dependent, because they do not require specifying the input column
 name in the operation specification syntax.
@@ -947,7 +954,7 @@ Group 3 (2 rows): customer_id = "c"
    2 │ c                        11       9
 ```
 
-## Getting the number of rows
+### Getting the number of rows
 
 You can get the number of rows per group in a `GroupedDataFrame` by just
 writing `nrow`, in which case the generated column name with the number of rows
@@ -1050,6 +1057,18 @@ julia> combine(gdf, groupindices)
    2 │ b                       2
    3 │ c                       3
 
+julia> transform(gdf, groupindices)
+6×4 DataFrame
+ Row │ customer_id  transaction_id  volume  groupindices 
+     │ String       Int64           Int64   Int64
+─────┼───────────────────────────────────────────────────
+   1 │ a                        12       2             1
+   2 │ b                        15       3             2
+   3 │ b                        19       1             2
+   4 │ b                        17       4             2
+   5 │ c                        13       5             3
+   6 │ c                        11       9             3
+
 julia> combine(gdf, groupindices => "group_number")
 3×2 DataFrame
  Row │ customer_id  group_number
@@ -1094,6 +1113,18 @@ julia> combine(gdf, eachindex)
    5 │ c                    1
    6 │ c                    2
 
+julia> select(gdf, eachindex, groupindices)
+6×3 DataFrame
+ Row │ customer_id  eachindex  groupindices 
+     │ String       Int64      Int64
+─────┼──────────────────────────────────────
+   1 │ a                    1             1
+   2 │ b                    1             2
+   3 │ b                    2             2
+   4 │ b                    3             2
+   5 │ c                    1             3
+   6 │ c                    2             3
+
 julia> combine(gdf, eachindex => "transaction_number")
 6×2 DataFrame
  Row │ customer_id  transaction_number
@@ -1183,24 +1214,6 @@ generating the `:nrow` column with number of rows per group. However, the
 `SubDataFrame` as its argument and returns its number of rows (the `:x1` column
 name is a default auto-generated column name in this case).
 
-To show you another example of passing a function consider the following case:
-
-```jldoctest sac
-julia> combine(gdf, :volume => sum, x -> sum(x.volume))
-3×3 DataFrame
- Row │ customer_id  volume_sum  x1
-     │ String       Int64       Int64
-─────┼────────────────────────────────
-   1 │ a                     2      2
-   2 │ b                     8      8
-   3 │ c                    14     14
-```
-
-Again, both `:volume_sum` and `:x1` columns hold the same data. The reason
-is that in `:volume => sum` we just apply the `sum` function to the `:volume`
-column, while in `x -> sum(x.volume`, `x` variable is a `SubDataFrame`
-representing the whole group.
-
 Passing a function taking a `SubDataFrame` is a flexible functionality allowing
 you to perform complex operations on your data. However, you should bear in mind
 two aspects:

From 1ace7439dc1a555a7f840eeb7a06fed5dde3a6cf Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Bogumi=C5=82=20Kami=C5=84ski?= <bkamins@sgh.waw.pl>
Date: Wed, 23 Nov 2022 11:24:45 +0100
Subject: [PATCH 07/13] switch to 'column-independent operations'

---
 docs/src/man/split_apply_combine.md | 32 ++++++++++++++---------------
 src/abstractdataframe/selection.jl  |  4 ++--
 2 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/docs/src/man/split_apply_combine.md b/docs/src/man/split_apply_combine.md
index 0d9dcf95eb..8d0467ce18 100644
--- a/docs/src/man/split_apply_combine.md
+++ b/docs/src/man/split_apply_combine.md
@@ -69,7 +69,7 @@ each subset of the `DataFrame`. This specification can be of the following forms
    except `AsTable` are allowed).
 4. a `col => target_cols` pair, which renames the column `col` to `target_cols`, which
    must be single name (as a `Symbol` or a string), a vector of names or `AsTable`.
-5. context-dependent expressions `function => target_cols` or just `function`
+5. column-independent operations `function => target_cols` or just `function`
    for specific `function`s where the input columns are omitted;
    without `target_cols` the new column has the same name as `function`, otherwise
    it must be single name (as a `Symbol` or a string). Supported `function`s are:
@@ -894,17 +894,17 @@ julia> df
    6 │     3  missing      6
 ```
 
-## Context-dependent expressions
+## Column-independent operations
 
 The operation specification language used with `combine`, `select` and `transform`
-supports the following context-dependent operations:
+supports the following column-independent operations:
 
 * getting the number of rows in a group (`nrow`);
 * getting the proportion of rows in a group (`proprow`);
 * getting the group number (`groupindices`);
 * getting a vector of indices within groups (`eachindex`).
 
-These operations are context-dependent, because they do not require specifying the input column
+These operations are column-independent, because they do not require specifying the input column
 name in the operation specification syntax.
 
 These four exceptions to the standard operation specification syntax were
@@ -985,8 +985,8 @@ julia> combine(gdf, nrow => "transaction_count")
 ```
 
 Note that in both cases we did not pass source column name as it is not needed
-to determine the number of rows per group. This is the reason why context-dependent
-expressions are exceptions to standard operation specification syntax.
+to determine the number of rows per group. This is the reason why column-independent
+operations are exceptions to standard operation specification syntax.
 
 The `nrow` expression also works in the operation specification syntax
 applied to a data frame. Here is an example:
@@ -1015,8 +1015,8 @@ easier to remember this exception.
 ### Getting the proportion of rows
 
 If you want to get a proportion of rows per group in a `GroupedDataFrame`
-you can use the `proprow` and `proprow => [target column name]` context-dependent
-expressions. Here are some examples:
+you can use the `proprow` and `proprow => [target column name]` column-independent
+operations. Here are some examples:
 
 ```jldoctest sac
 julia> combine(gdf, proprow)
@@ -1044,7 +1044,7 @@ specification syntax and is only allowed when processing a `GroupedDataFrame`.
 ### Getting the group number
 
 Another common operation is getting group number. Use the `groupindices` and
-`groupindices => [target column name]` context-dependent expressions to get it:
+`groupindices => [target column name]` column-independent operations to get it:
 
 
 ```jldoctest sac
@@ -1096,7 +1096,7 @@ julia> groupindices(gdf)
 
 ### Getting a vector of indices within groups
 
-The last context-dependent expression supported by the operation
+The last column-independent operation supported by the operation
 specification syntax is getting the index of each row within each group:
 
 
@@ -1188,13 +1188,13 @@ julia> combine(gdf, eachindex, :customer_id => eachindex)
 ```
 
 
-## Context-dependent expressions versus functions
+## Column-independent operations versus functions
 
-When discussing context dependent expressions it is important to remember
+When discussing column-independent operations it is important to remember
 that operation specification syntax allows you to pass a function (without
 source and target column names), in which case such a function gets passed a
 `SubDataFrame` that represents a group in a `GroupedDataFrame`. Here is an
-example:
+example comparing column-independent operation and a function:
 
 ```jldoctest sac
 julia> combine(gdf, nrow, x -> nrow(x))
@@ -1208,7 +1208,7 @@ julia> combine(gdf, nrow, x -> nrow(x))
 ```
 
 Notice that columns `:nrow` and `:x1` have an identical contents. This is
-expected. We already know that `nrow` is a context dependent expression
+expected. We already know that `nrow` is a column-independent operation
 generating the `:nrow` column with number of rows per group. However, the
 `x -> nrow(x)` anonymous function does exactly the same as it gets a
 `SubDataFrame` as its argument and returns its number of rows (the `:x1` column
@@ -1224,8 +1224,8 @@ two aspects:
   comparison to just passing a function taking a `SubDataFrame`.
 * Although writing `row`, `proprow`, `groupindices`, and `eachindex` looks like
   just passing a function they **do not** take a `SubDataFrame` as their
-  argument. As we explained in this section, they are special context-dependent
-  expressions that are exceptions to the standard operation specification syntax
+  argument. As we explained in this section, they are special column-independent
+  operations that are exceptions to the standard operation specification syntax
   rules. They were added for user convenience (and at the same time they are
   optimized to be fast).
 
diff --git a/src/abstractdataframe/selection.jl b/src/abstractdataframe/selection.jl
index 6cd4d4787d..bb1f8a070a 100644
--- a/src/abstractdataframe/selection.jl
+++ b/src/abstractdataframe/selection.jl
@@ -75,7 +75,7 @@ const TRANSFORMATION_COMMON_RULES =
        except `AsTable` are allowed).
     4. a `col => target_cols` pair, which renames the column `col` to `target_cols`, which
        must be single name (as a `Symbol` or a string), a vector of names or `AsTable`.
-    5. context-dependent expressions `function => target_cols` or just `function`
+    5. column-independent operations `function => target_cols` or just `function`
        for specific `function`s where the input columns are omitted;
        without `target_cols` the new column has the same name as `function`, otherwise
        it must be single name (as a `Symbol` or a string). Supported `function`s are:
@@ -1267,7 +1267,7 @@ julia> select(gd, :, AsTable(Not(:a)) => sum, renamecols=false)
    8 │     2      1      8      9
 ```
 
-# context-dependent expressions
+# column-independent operations
 ```jldoctest
 julia> df = DataFrame(a=[1, 1, 1, 2, 2, 1, 1, 2],
                       b=repeat([2, 1], outer=[4]),

From 010f68ef3fc9d40d083e9d36537144b422db30f0 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Bogumi=C5=82=20Kami=C5=84ski?= <bkamins@sgh.waw.pl>
Date: Tue, 29 Nov 2022 17:21:39 +0100
Subject: [PATCH 08/13] Apply suggestions from code review

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
---
 docs/src/man/split_apply_combine.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/src/man/split_apply_combine.md b/docs/src/man/split_apply_combine.md
index 8d0467ce18..ed8b605f98 100644
--- a/docs/src/man/split_apply_combine.md
+++ b/docs/src/man/split_apply_combine.md
@@ -1194,7 +1194,7 @@ When discussing column-independent operations it is important to remember
 that operation specification syntax allows you to pass a function (without
 source and target column names), in which case such a function gets passed a
 `SubDataFrame` that represents a group in a `GroupedDataFrame`. Here is an
-example comparing column-independent operation and a function:
+example comparing a column-independent operation and a function:
 
 ```jldoctest sac
 julia> combine(gdf, nrow, x -> nrow(x))
@@ -1207,7 +1207,7 @@ julia> combine(gdf, nrow, x -> nrow(x))
    3 │ c                2      2
 ```
 
-Notice that columns `:nrow` and `:x1` have an identical contents. This is
+Notice that columns `:nrow` and `:x1` have identical contents. This is
 expected. We already know that `nrow` is a column-independent operation
 generating the `:nrow` column with number of rows per group. However, the
 `x -> nrow(x)` anonymous function does exactly the same as it gets a

From 20a0f90bebf178caefe56862ef14e91bb552e15d Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Bogumi=C5=82=20Kami=C5=84ski?= <bkamins@sgh.waw.pl>
Date: Tue, 29 Nov 2022 17:49:40 +0100
Subject: [PATCH 09/13] improve explanations

---
 docs/src/man/split_apply_combine.md | 63 +++++++++++++++++++++++------
 1 file changed, 50 insertions(+), 13 deletions(-)

diff --git a/docs/src/man/split_apply_combine.md b/docs/src/man/split_apply_combine.md
index ed8b605f98..7c3e1a05e0 100644
--- a/docs/src/man/split_apply_combine.md
+++ b/docs/src/man/split_apply_combine.md
@@ -1197,7 +1197,41 @@ source and target column names), in which case such a function gets passed a
 example comparing a column-independent operation and a function:
 
 ```jldoctest sac
-julia> combine(gdf, nrow, x -> nrow(x))
+julia> combine(gdf, eachindex, sdf -> axes(sdf, 1))
+6×3 DataFrame
+ Row │ customer_id  eachindex  x1    
+     │ String       Int64      Int64 
+─────┼───────────────────────────────
+   1 │ a                    1      1
+   2 │ b                    1      1
+   3 │ b                    2      2
+   4 │ b                    3      3
+   5 │ c                    1      1
+   6 │ c                    2      2
+```
+
+Notice that column independent operation `eachindex` produces the same result
+as using anonymous function `sdf -> axes(sdf, 1)` that takes a `SubDataFrame`
+as its first argument and returns indices along its first axes.
+Importantly without special definition of column-independent operation
+the `eachindex` function would fail when being passed as you can see here:
+
+```jldoctest sac
+julia> combine(gdf, eachindex, sdf -> eachindex(sdf))
+ERROR: MethodError: no method matching keys(::SubDataFrame{DataFrame, DataFrames.Index, Vector{Int64}})
+```
+
+The reason for this error is that `eachindex` function does not allow passing a
+`SubDataFrame` as its argument.
+
+The same situation is with `proprow` and `groupindices`. They would not work
+with a `SubDataFrame` as stand-alone functions.
+
+A bit different case is with `nrow` column-independent operation. In this case
+the `nrow` function accepts `SubDataFrame` as an argument:
+
+```jldoctest sac
+julia> combine(gdf, nrow, sdf -> nrow(sdf))
 3×3 DataFrame
  Row │ customer_id  nrow   x1
      │ String       Int64  Int64
@@ -1207,12 +1241,13 @@ julia> combine(gdf, nrow, x -> nrow(x))
    3 │ c                2      2
 ```
 
-Notice that columns `:nrow` and `:x1` have identical contents. This is
-expected. We already know that `nrow` is a column-independent operation
-generating the `:nrow` column with number of rows per group. However, the
-`x -> nrow(x)` anonymous function does exactly the same as it gets a
-`SubDataFrame` as its argument and returns its number of rows (the `:x1` column
-name is a default auto-generated column name in this case).
+Notice that columns `:nrow` and `:x1` have identical contents, but the
+difference is that they do not have the same names. `nrow` is a
+column-independent operation generating the `:nrow` column name by default with
+number of rows per group. On the other hand, the `sdf -> nrow(sdf)` anonymous
+function does gets a `SubDataFrame` as its argument and returns its number of
+rows. The `:x1` column name is a default auto-generated column name when
+processing anonymous functions.
 
 Passing a function taking a `SubDataFrame` is a flexible functionality allowing
 you to perform complex operations on your data. However, you should bear in mind
@@ -1222,10 +1257,12 @@ two aspects:
   names are passed) will lead to faster execution of your code (as the Julia
   compiler is able to better optimize execution of such operations) in
   comparison to just passing a function taking a `SubDataFrame`.
-* Although writing `row`, `proprow`, `groupindices`, and `eachindex` looks like
-  just passing a function they **do not** take a `SubDataFrame` as their
-  argument. As we explained in this section, they are special column-independent
-  operations that are exceptions to the standard operation specification syntax
-  rules. They were added for user convenience (and at the same time they are
-  optimized to be fast).
+* Although writing `nrow`, `proprow`, `groupindices`, and `eachindex` looks
+  like just passing a function they internally **do not** take a `SubDataFrame`
+  as their argument. As we explained in this section, `proprow`,
+  `groupindices`, and `eachindex` would not work with `SubDataFrame` as their
+  argument, and `nrow` would work, but would prouce a different column name.
+  Instead, these four operations are special column-independent operations that
+  are exceptions to the standard operation specification syntax rules. They
+  were added for user convenience.
 

From 70e385ab94374a20ade3bd0e46ebf65e2cd12809 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Bogumi=C5=82=20Kami=C5=84ski?= <bkamins@sgh.waw.pl>
Date: Wed, 30 Nov 2022 08:42:17 +0100
Subject: [PATCH 10/13] Apply suggestions from code review

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
---
 docs/src/man/split_apply_combine.md | 25 +++++++++++++------------
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/docs/src/man/split_apply_combine.md b/docs/src/man/split_apply_combine.md
index 7c3e1a05e0..71508f3413 100644
--- a/docs/src/man/split_apply_combine.md
+++ b/docs/src/man/split_apply_combine.md
@@ -1210,24 +1210,24 @@ julia> combine(gdf, eachindex, sdf -> axes(sdf, 1))
    6 │ c                    2      2
 ```
 
-Notice that column independent operation `eachindex` produces the same result
-as using anonymous function `sdf -> axes(sdf, 1)` that takes a `SubDataFrame`
+Notice that the column independent operation `eachindex` produces the same result
+as using the anonymous function `sdf -> axes(sdf, 1)` that takes a `SubDataFrame`
 as its first argument and returns indices along its first axes.
-Importantly without special definition of column-independent operation
+Importantly if it wasn't defined as a column-independent operation
 the `eachindex` function would fail when being passed as you can see here:
 
 ```jldoctest sac
-julia> combine(gdf, eachindex, sdf -> eachindex(sdf))
+julia> combine(gdf, sdf -> eachindex(sdf))
 ERROR: MethodError: no method matching keys(::SubDataFrame{DataFrame, DataFrames.Index, Vector{Int64}})
 ```
 
-The reason for this error is that `eachindex` function does not allow passing a
+The reason for this error is that the `eachindex` function does not allow passing a
 `SubDataFrame` as its argument.
 
-The same situation is with `proprow` and `groupindices`. They would not work
+The same applies to `proprow` and `groupindices`: they would not work
 with a `SubDataFrame` as stand-alone functions.
 
-A bit different case is with `nrow` column-independent operation. In this case
+The `nrow` column-independent operation is a different case, as
 the `nrow` function accepts `SubDataFrame` as an argument:
 
 ```jldoctest sac
@@ -1246,7 +1246,7 @@ difference is that they do not have the same names. `nrow` is a
 column-independent operation generating the `:nrow` column name by default with
 number of rows per group. On the other hand, the `sdf -> nrow(sdf)` anonymous
 function does gets a `SubDataFrame` as its argument and returns its number of
-rows. The `:x1` column name is a default auto-generated column name when
+rows. The `:x1` column name is the default auto-generated column name when
 processing anonymous functions.
 
 Passing a function taking a `SubDataFrame` is a flexible functionality allowing
@@ -1254,14 +1254,15 @@ you to perform complex operations on your data. However, you should bear in mind
 two aspects:
 
 * Using the full operation specification syntax (where source and target column
-  names are passed) will lead to faster execution of your code (as the Julia
-  compiler is able to better optimize execution of such operations) in
-  comparison to just passing a function taking a `SubDataFrame`.
+  names are passed) or column-independent operations will lead to faster
+  execution of your code (as the Julia compiler is able to better optimize
+  execution of such operations) in comparison to passing a function
+  taking a `SubDataFrame`.
 * Although writing `nrow`, `proprow`, `groupindices`, and `eachindex` looks
   like just passing a function they internally **do not** take a `SubDataFrame`
   as their argument. As we explained in this section, `proprow`,
   `groupindices`, and `eachindex` would not work with `SubDataFrame` as their
-  argument, and `nrow` would work, but would prouce a different column name.
+  argument, and `nrow` would work, but would produce a different column name.
   Instead, these four operations are special column-independent operations that
   are exceptions to the standard operation specification syntax rules. They
   were added for user convenience.

From 8eab2ed6eca1fe6f4f5c093546d1cbf871861fb4 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Bogumi=C5=82=20Kami=C5=84ski?= <bkamins@sgh.waw.pl>
Date: Wed, 30 Nov 2022 09:19:14 +0100
Subject: [PATCH 11/13] update iteration and indexing examples

---
 docs/src/man/split_apply_combine.md | 231 +++++++++++++---------------
 1 file changed, 111 insertions(+), 120 deletions(-)

diff --git a/docs/src/man/split_apply_combine.md b/docs/src/man/split_apply_combine.md
index 71508f3413..50a98841b6 100644
--- a/docs/src/man/split_apply_combine.md
+++ b/docs/src/man/split_apply_combine.md
@@ -519,6 +519,94 @@ Number of data points for Iris-versicolor: 50
 Number of data points for Iris-virginica: 50
 ```
 
+The value of `key` in the example above where we iterated `pairs(iris_gdf)` is
+a [`DataFrames.GroupKey`](@ref) object, which can be used in a similar fashion
+to a `NamedTuple`.
+
+Grouping a data frame using the `groupby` function can be seen as adding a
+lookup key to it. Such lookups can be performed efficiently by indexing the
+resulting `GroupedDataFrame` with [`DataFrames.GroupKey`](@ref) (as it was
+presented above) a `Tuple`, a `NamedTuple`, or a dictionary. Here are some
+more examples of such indexing.
+
+```jldoctest sac
+julia> iris_gdf[(Species="Iris-virginica",)]  # a NamedTuple
+50×5 SubDataFrame
+ Row │ SepalLength  SepalWidth  PetalLength  PetalWidth  Species
+     │ Float64      Float64     Float64      Float64     String15
+─────┼──────────────────────────────────────────────────────────────────
+   1 │         6.3         3.3          6.0         2.5  Iris-virginica
+   2 │         5.8         2.7          5.1         1.9  Iris-virginica
+   3 │         7.1         3.0          5.9         2.1  Iris-virginica
+  ⋮  │      ⋮           ⋮            ⋮           ⋮             ⋮
+  48 │         6.5         3.0          5.2         2.0  Iris-virginica
+  49 │         6.2         3.4          5.4         2.3  Iris-virginica
+  50 │         5.9         3.0          5.1         1.8  Iris-virginica
+                                                         44 rows omitted
+
+julia> iris_gdf[[("Iris-virginica",), ("Iris-setosa",)]] # a vector of Tuples
+GroupedDataFrame with 2 groups based on key: Species
+First Group (50 rows): Species = "Iris-virginica"
+ Row │ SepalLength  SepalWidth  PetalLength  PetalWidth  Species
+     │ Float64      Float64     Float64      Float64     String15
+─────┼──────────────────────────────────────────────────────────────────
+   1 │         6.3         3.3          6.0         2.5  Iris-virginica
+  ⋮  │      ⋮           ⋮            ⋮           ⋮             ⋮
+  50 │         5.9         3.0          5.1         1.8  Iris-virginica
+                                                         48 rows omitted
+⋮
+Last Group (50 rows): Species = "Iris-setosa"
+ Row │ SepalLength  SepalWidth  PetalLength  PetalWidth  Species
+     │ Float64      Float64     Float64      Float64     String15
+─────┼───────────────────────────────────────────────────────────────
+   1 │         5.1         3.5          1.4         0.2  Iris-setosa
+  ⋮  │      ⋮           ⋮            ⋮           ⋮            ⋮
+  50 │         5.0         3.3          1.4         0.2  Iris-setosa
+                                                      48 rows omitted
+
+julia> key = keys(iris_gdf) |> last # last key in iris_gdf
+GroupKey: (Species = String15("Iris-virginica"),)
+
+julia> iris_gdf[key]
+50×5 SubDataFrame
+ Row │ SepalLength  SepalWidth  PetalLength  PetalWidth  Species
+     │ Float64      Float64     Float64      Float64     String15
+─────┼──────────────────────────────────────────────────────────────────
+   1 │         6.3         3.3          6.0         2.5  Iris-virginica
+   2 │         5.8         2.7          5.1         1.9  Iris-virginica
+   3 │         7.1         3.0          5.9         2.1  Iris-virginica
+   4 │         6.3         2.9          5.6         1.8  Iris-virginica
+   5 │         6.5         3.0          5.8         2.2  Iris-virginica
+   6 │         7.6         3.0          6.6         2.1  Iris-virginica
+  ⋮  │      ⋮           ⋮            ⋮           ⋮             ⋮
+  45 │         6.7         3.3          5.7         2.5  Iris-virginica
+  46 │         6.7         3.0          5.2         2.3  Iris-virginica
+  47 │         6.3         2.5          5.0         1.9  Iris-virginica
+  48 │         6.5         3.0          5.2         2.0  Iris-virginica
+  49 │         6.2         3.4          5.4         2.3  Iris-virginica
+  50 │         5.9         3.0          5.1         1.8  Iris-virginica
+                                                         38 rows omitted
+julia> iris_gdf[Dict("Species" => "Iris-setosa")] # a dictionary
+50×5 SubDataFrame
+ Row │ SepalLength  SepalWidth  PetalLength  PetalWidth  Species
+     │ Float64      Float64     Float64      Float64     String15
+─────┼───────────────────────────────────────────────────────────────
+   1 │         5.1         3.5          1.4         0.2  Iris-setosa
+   2 │         4.9         3.0          1.4         0.2  Iris-setosa
+   3 │         4.7         3.2          1.3         0.2  Iris-setosa
+   4 │         4.6         3.1          1.5         0.2  Iris-setosa
+   5 │         5.0         3.6          1.4         0.2  Iris-setosa
+   6 │         5.4         3.9          1.7         0.4  Iris-setosa
+  ⋮  │      ⋮           ⋮            ⋮           ⋮            ⋮
+  45 │         5.1         3.8          1.9         0.4  Iris-setosa
+  46 │         4.8         3.0          1.4         0.3  Iris-setosa
+  47 │         5.1         3.8          1.6         0.2  Iris-setosa
+  48 │         4.6         3.2          1.4         0.2  Iris-setosa
+  49 │         5.3         3.7          1.5         0.2  Iris-setosa
+  50 │         5.0         3.3          1.4         0.2  Iris-setosa
+                                                      38 rows omitted
+```
+
 Note that although `GroupedDataFrame` is iterable and indexable it is not an
 `AbstractVector`. For this reason currently it was decided that it does not
 support `map` nor broadcasting (to allow for making a decision in the future
@@ -527,12 +615,6 @@ data frame and get a vector of results either use a comprehension or `collect`
 `GroupedDataFrame` into a vector first. Here are examples of both approaches:
 
 ```jldoctest sac
-julia> [nrow(sdf) for sdf in iris_gdf]
-3-element Vector{Int64}:
- 50
- 50
- 50
-
 julia> sdf_vec = collect(iris_gdf)
 3-element Vector{Any}:
  50×5 SubDataFrame
@@ -612,123 +694,32 @@ julia> nrow.(sdf_vec)
  50
 ```
 
-Note that using the split-apply-combine strategy with operation specification
-syntax in `combine`, `select` or `transform` will usually be faster than iterating
-a `GroupedDataFrame`.
+Since `GroupedDataFrame` is iterable, you can achieve the same result with a
+comprehension:
 
-The value of `key` in the example above where we iterated `pairs(iris_gdf)`
-is a [`DataFrames.GroupKey`](@ref) object,
-which can be used in a similar fashion to a `NamedTuple`.
+```jldoctest sac
+julia> [nrow(sdf) for sdf in iris_gdf]
+3-element Vector{Int64}:
+ 50
+ 50
+ 50
+```
 
-Grouping a data frame using the `groupby` function can be seen as adding a
-lookup key to it. Such lookups can be performed efficiently by indexing the
-resulting `GroupedDataFrame` with [`DataFrames.GroupKey`](@ref) (as it was
-presented aboce) a `Tuple`, a `NamedTuple`, or a dictionary. Here are some
-more examples of such indexing.
+Note that using the split-apply-combine strategy with operation specification
+syntax in `combine`, `select` or `transform` will usually be faster for large
+`GroupedDataFrame` object than iterating it, with the difference that they
+produce a data frame. For the above examples an operation corresponding
+to the examples above is:
 
-```jldoctest sac
-julia> df = DataFrame(g=repeat(1:1000, inner=5), x=1:5000)
-5000×2 DataFrame
-  Row │ g      x
-      │ Int64  Int64
-──────┼──────────────
-    1 │     1      1
-    2 │     1      2
-    3 │     1      3
-    4 │     1      4
-    5 │     1      5
-    6 │     2      6
-    7 │     2      7
-    8 │     2      8
-  ⋮   │   ⋮      ⋮
- 4994 │   999   4994
- 4995 │   999   4995
- 4996 │  1000   4996
- 4997 │  1000   4997
- 4998 │  1000   4998
- 4999 │  1000   4999
- 5000 │  1000   5000
-    4985 rows omitted
-
-julia> gd = groupby(df, :g)
-GroupedDataFrame with 1000 groups based on key: g
-First Group (5 rows): g = 1
- Row │ g      x
-     │ Int64  Int64
-─────┼──────────────
-   1 │     1      1
-   2 │     1      2
-   3 │     1      3
-   4 │     1      4
-   5 │     1      5
-⋮
-Last Group (5 rows): g = 1000
- Row │ g      x
-     │ Int64  Int64
-─────┼──────────────
-   1 │  1000   4996
-   2 │  1000   4997
-   3 │  1000   4998
-   4 │  1000   4999
-   5 │  1000   5000
-
-julia> gd[(g=500,)] # a NamedTuple
-5×2 SubDataFrame
- Row │ g      x
-     │ Int64  Int64
-─────┼──────────────
-   1 │   500   2496
-   2 │   500   2497
-   3 │   500   2498
-   4 │   500   2499
-   5 │   500   2500
-
-julia> gd[[(500,), (501,)]] # a vector of Tuples
-GroupedDataFrame with 2 groups based on key: g
-First Group (5 rows): g = 500
- Row │ g      x
-     │ Int64  Int64
-─────┼──────────────
-   1 │   500   2496
-   2 │   500   2497
-   3 │   500   2498
-   4 │   500   2499
-   5 │   500   2500
-⋮
-Last Group (5 rows): g = 501
- Row │ g      x
-     │ Int64  Int64
-─────┼──────────────
-   1 │   501   2501
-   2 │   501   2502
-   3 │   501   2503
-   4 │   501   2504
-   5 │   501   2505
-
-julia> key = keys(gd) |> last # first key in gd
-GroupKey: (g = 1000,)
-
-julia> gd[key]
-5×2 SubDataFrame
- Row │ g      x     
-     │ Int64  Int64
-─────┼──────────────
-   1 │  1000   4996
-   2 │  1000   4997
-   3 │  1000   4998
-   4 │  1000   4999
-   5 │  1000   5000
-
-julia> gd[Dict("g" => 1000)] # a dictionary
-5×2 SubDataFrame
- Row │ g      x     
-     │ Int64  Int64
-─────┼──────────────
-   1 │  1000   4996
-   2 │  1000   4997
-   3 │  1000   4998
-   4 │  1000   4999
-   5 │  1000   5000
+```
+julia> combine(iris_gdf, nrow)
+3×2 DataFrame
+ Row │ Species          nrow  
+     │ String15         Int64
+─────┼────────────────────────
+   1 │ Iris-setosa         50
+   2 │ Iris-versicolor     50
+   3 │ Iris-virginica      50
 ```
 
 ## Simulating the SQL `where` clause

From 2a1009dbb4b94b04eadb3e69cabc34d0546660ab Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Bogumi=C5=82=20Kami=C5=84ski?= <bkamins@sgh.waw.pl>
Date: Wed, 30 Nov 2022 10:59:41 +0100
Subject: [PATCH 12/13] fix docs output

---
 docs/src/man/split_apply_combine.md | 29 ++++++++++++++++++++++++-----
 1 file changed, 24 insertions(+), 5 deletions(-)

diff --git a/docs/src/man/split_apply_combine.md b/docs/src/man/split_apply_combine.md
index 50a98841b6..50b7af151f 100644
--- a/docs/src/man/split_apply_combine.md
+++ b/docs/src/man/split_apply_combine.md
@@ -538,11 +538,20 @@ julia> iris_gdf[(Species="Iris-virginica",)]  # a NamedTuple
    1 │         6.3         3.3          6.0         2.5  Iris-virginica
    2 │         5.8         2.7          5.1         1.9  Iris-virginica
    3 │         7.1         3.0          5.9         2.1  Iris-virginica
+   4 │         6.3         2.9          5.6         1.8  Iris-virginica
+   5 │         6.5         3.0          5.8         2.2  Iris-virginica
+   6 │         7.6         3.0          6.6         2.1  Iris-virginica
+   7 │         4.9         2.5          4.5         1.7  Iris-virginica
+   8 │         7.3         2.9          6.3         1.8  Iris-virginica
   ⋮  │      ⋮           ⋮            ⋮           ⋮             ⋮
+  44 │         6.8         3.2          5.9         2.3  Iris-virginica
+  45 │         6.7         3.3          5.7         2.5  Iris-virginica
+  46 │         6.7         3.0          5.2         2.3  Iris-virginica
+  47 │         6.3         2.5          5.0         1.9  Iris-virginica
   48 │         6.5         3.0          5.2         2.0  Iris-virginica
   49 │         6.2         3.4          5.4         2.3  Iris-virginica
   50 │         5.9         3.0          5.1         1.8  Iris-virginica
-                                                         44 rows omitted
+                                                         35 rows omitted
 
 julia> iris_gdf[[("Iris-virginica",), ("Iris-setosa",)]] # a vector of Tuples
 GroupedDataFrame with 2 groups based on key: Species
@@ -551,18 +560,21 @@ First Group (50 rows): Species = "Iris-virginica"
      │ Float64      Float64     Float64      Float64     String15
 ─────┼──────────────────────────────────────────────────────────────────
    1 │         6.3         3.3          6.0         2.5  Iris-virginica
+   2 │         5.8         2.7          5.1         1.9  Iris-virginica
   ⋮  │      ⋮           ⋮            ⋮           ⋮             ⋮
+  49 │         6.2         3.4          5.4         2.3  Iris-virginica
   50 │         5.9         3.0          5.1         1.8  Iris-virginica
-                                                         48 rows omitted
+                                                         46 rows omitted
 ⋮
 Last Group (50 rows): Species = "Iris-setosa"
  Row │ SepalLength  SepalWidth  PetalLength  PetalWidth  Species
      │ Float64      Float64     Float64      Float64     String15
 ─────┼───────────────────────────────────────────────────────────────
    1 │         5.1         3.5          1.4         0.2  Iris-setosa
+   2 │         4.9         3.0          1.4         0.2  Iris-setosa
   ⋮  │      ⋮           ⋮            ⋮           ⋮            ⋮
   50 │         5.0         3.3          1.4         0.2  Iris-setosa
-                                                      48 rows omitted
+                                                      47 rows omitted
 
 julia> key = keys(iris_gdf) |> last # last key in iris_gdf
 GroupKey: (Species = String15("Iris-virginica"),)
@@ -578,14 +590,18 @@ julia> iris_gdf[key]
    4 │         6.3         2.9          5.6         1.8  Iris-virginica
    5 │         6.5         3.0          5.8         2.2  Iris-virginica
    6 │         7.6         3.0          6.6         2.1  Iris-virginica
+   7 │         4.9         2.5          4.5         1.7  Iris-virginica
+   8 │         7.3         2.9          6.3         1.8  Iris-virginica
   ⋮  │      ⋮           ⋮            ⋮           ⋮             ⋮
+  44 │         6.8         3.2          5.9         2.3  Iris-virginica
   45 │         6.7         3.3          5.7         2.5  Iris-virginica
   46 │         6.7         3.0          5.2         2.3  Iris-virginica
   47 │         6.3         2.5          5.0         1.9  Iris-virginica
   48 │         6.5         3.0          5.2         2.0  Iris-virginica
   49 │         6.2         3.4          5.4         2.3  Iris-virginica
   50 │         5.9         3.0          5.1         1.8  Iris-virginica
-                                                         38 rows omitted
+                                                         35 rows omitted
+
 julia> iris_gdf[Dict("Species" => "Iris-setosa")] # a dictionary
 50×5 SubDataFrame
  Row │ SepalLength  SepalWidth  PetalLength  PetalWidth  Species
@@ -597,14 +613,17 @@ julia> iris_gdf[Dict("Species" => "Iris-setosa")] # a dictionary
    4 │         4.6         3.1          1.5         0.2  Iris-setosa
    5 │         5.0         3.6          1.4         0.2  Iris-setosa
    6 │         5.4         3.9          1.7         0.4  Iris-setosa
+   7 │         4.6         3.4          1.4         0.3  Iris-setosa
+   8 │         5.0         3.4          1.5         0.2  Iris-setosa
   ⋮  │      ⋮           ⋮            ⋮           ⋮            ⋮
+  44 │         5.0         3.5          1.6         0.6  Iris-setosa
   45 │         5.1         3.8          1.9         0.4  Iris-setosa
   46 │         4.8         3.0          1.4         0.3  Iris-setosa
   47 │         5.1         3.8          1.6         0.2  Iris-setosa
   48 │         4.6         3.2          1.4         0.2  Iris-setosa
   49 │         5.3         3.7          1.5         0.2  Iris-setosa
   50 │         5.0         3.3          1.4         0.2  Iris-setosa
-                                                      38 rows omitted
+                                                      35 rows omitted
 ```
 
 Note that although `GroupedDataFrame` is iterable and indexable it is not an

From 3fc1ddd9941db3fc43cc499136b0fdf3e48e1c60 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Bogumi=C5=82=20Kami=C5=84ski?= <bkamins@sgh.waw.pl>
Date: Thu, 1 Dec 2022 08:40:20 +0100
Subject: [PATCH 13/13] Apply suggestions from code review

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
---
 docs/src/man/split_apply_combine.md | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/docs/src/man/split_apply_combine.md b/docs/src/man/split_apply_combine.md
index 50b7af151f..8961744a89 100644
--- a/docs/src/man/split_apply_combine.md
+++ b/docs/src/man/split_apply_combine.md
@@ -724,11 +724,10 @@ julia> [nrow(sdf) for sdf in iris_gdf]
  50
 ```
 
-Note that using the split-apply-combine strategy with operation specification
+Note that using the split-apply-combine strategy with the operation specification
 syntax in `combine`, `select` or `transform` will usually be faster for large
-`GroupedDataFrame` object than iterating it, with the difference that they
-produce a data frame. For the above examples an operation corresponding
-to the examples above is:
+`GroupedDataFrame` objects than iterating them, with the difference that they
+produce a data frame. An operation corresponding to the example above is:
 
 ```
 julia> combine(iris_gdf, nrow)
@@ -1220,7 +1219,7 @@ julia> combine(gdf, eachindex, sdf -> axes(sdf, 1))
    6 │ c                    2      2
 ```
 
-Notice that the column independent operation `eachindex` produces the same result
+Notice that the column-independent operation `eachindex` produces the same result
 as using the anonymous function `sdf -> axes(sdf, 1)` that takes a `SubDataFrame`
 as its first argument and returns indices along its first axes.
 Importantly if it wasn't defined as a column-independent operation