Skip to content

Commit

Permalink
Working on documentation for window functions
Browse files Browse the repository at this point in the history
  • Loading branch information
timsaucer committed Aug 23, 2024
1 parent c77af1b commit 35547c8
Show file tree
Hide file tree
Showing 3 changed files with 111 additions and 88 deletions.
2 changes: 2 additions & 0 deletions docs/source/user-guide/common-operations/aggregations.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@
.. specific language governing permissions and limitations
.. under the License.
.. _aggregation:

Aggregation
============

Expand Down
89 changes: 60 additions & 29 deletions docs/source/user-guide/common-operations/windows.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,55 +43,86 @@ We'll use the pokemon dataset (from Ritchie Vink) in the following examples.
ctx = SessionContext()
df = ctx.read_csv("pokemon.csv")
Here is an example that shows how to compare each pokemons’s attack power with the average attack
power in its ``"Type 1"``
Here is an example that shows how you can compare each pokemon's speed to the speed of the
previous row in the DataFrame.

.. ipython:: python
df.select(
col('"Name"'),
col('"Attack"'),
#f.alias(
# f.window("avg", [col('"Attack"')], partition_by=[col('"Type 1"')]),
# "Average Attack",
#)
col('"Speed"'),
f.lag(col('"Speed"')).alias("Previous Speed")
)
You can also control the order in which rows are processed by window functions by providing
Setting Parameters
------------------

You can control the order in which rows are processed by window functions by providing
a list of ``order_by`` functions for the ``order_by`` parameter.

.. ipython:: python
df.select(
col('"Name"'),
col('"Attack"'),
#f.alias(
# f.window(
# "rank",
# [],
# partition_by=[col('"Type 1"')],
# order_by=[f.order_by(col('"Attack"'))],
# ),
# "rank",
#),
col('"Type 1"'),
f.rank()
.partition_by(col('"Type 1"'))
.order_by(col('"Attack"').sort(ascending=True))
.build()
.alias("rank"),
).sort(col('"Type 1"').sort(), col('"Attack"').sort())
Window Functions can be configured using a builder approach to set a few parameters.
To create a builder you simply need to call any one of these functions

- :py:func:`datafusion.expr.Expr.order_by` to set the window ordering.
- :py:func:`datafusion.expr.Expr.null_treatment` to set how ``null`` values should be handled.
- :py:func:`datafusion.expr.Expr.partition_by` to set the partitions for processing.
- :py:func:`datafusion.expr.Expr.window_frame` to set boundary of operation.

After these parameters are set, you must call ``build()`` on the resultant object to get an
expression as shown in the example above.

Aggregate Functions
-------------------

You can use any :ref:`Aggregation Function<aggregation>` as a window function. Currently
aggregate functions must use the deprecated
:py:func:`datafusion.functions.window` API but this should be resolved in
DataFusion 42.0 (`Issue Link <https://github.com/apache/datafusion-python/issues/833>`_). Here
is an example that shows how to compare each pokemons’s attack power with the average attack
power in its ``"Type 1"`` using the :py:func:`datafusion.functions.avg` function.

.. ipython:: python
:okwarning:
df.select(
col('"Name"'),
col('"Attack"'),
col('"Type 1"'),
f.window("avg", [col('"Attack"')])
.partition_by(col('"Type 1"'))
.build()
.alias("Average Attack"),
)
Available Functions
-------------------

The possible window functions are:

1. Rank Functions
- rank
- dense_rank
- row_number
- ntile
- :py:func:`datafusion.functions.rank`
- :py:func:`datafusion.functions.dense_rank`
- :py:func:`datafusion.functions.ntile`
- :py:func:`datafusion.functions.row_number`

2. Analytical Functions
- cume_dist
- percent_rank
- lag
- lead
- first_value
- last_value
- nth_value
- :py:func:`datafusion.functions.cume_dist`
- :py:func:`datafusion.functions.percent_rank`
- :py:func:`datafusion.functions.lag`
- :py:func:`datafusion.functions.lead`

3. Aggregate Functions
- All aggregate functions can be used as window functions.
- All :ref:`Aggregation Functions<aggregation>` can be used as window functions.
108 changes: 49 additions & 59 deletions python/datafusion/functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -259,6 +259,7 @@
"dense_rank",
"percent_rank",
"cume_dist",
"ntile",
]


Expand Down Expand Up @@ -1816,18 +1817,16 @@ def rank() -> Expr:
is an example of a dataframe with a window ordered by descending ``points`` and the
associated rank.
You should set ``order_by`` to produce meaningful results.
You should set ``order_by`` to produce meaningful results::
```
+--------+------+
| points | rank |
+--------+------+
| 100 | 1 |
| 100 | 1 |
| 50 | 3 |
| 25 | 4 |
+--------+------+
```
+--------+------+
| points | rank |
+--------+------+
| 100 | 1 |
| 100 | 1 |
| 50 | 3 |
| 25 | 4 |
+--------+------+
To set window function parameters use the window builder approach described in the
ref:`_window_functions` online documentation.
Expand All @@ -1840,18 +1839,16 @@ def dense_rank() -> Expr:
This window function is similar to :py:func:`rank` except that the returned values
will be consecutive. Here is an example of a dataframe with a window ordered by
descending ``points`` and the associated dense rank.
descending ``points`` and the associated dense rank::
```
+--------+------------+
| points | dense_rank |
+--------+------------+
| 100 | 1 |
| 100 | 1 |
| 50 | 2 |
| 25 | 3 |
+--------+------------+
```
+--------+------------+
| points | dense_rank |
+--------+------------+
| 100 | 1 |
| 100 | 1 |
| 50 | 2 |
| 25 | 3 |
+--------+------------+
To set window function parameters use the window builder approach described in the
ref:`_window_functions` online documentation.
Expand All @@ -1865,18 +1862,16 @@ def percent_rank() -> Expr:
This window function is similar to :py:func:`rank` except that the returned values
are the percentage from 0.0 to 1.0 from first to last. Here is an example of a
dataframe with a window ordered by descending ``points`` and the associated percent
rank.
rank::
```
+--------+--------------+
| points | percent_rank |
+--------+--------------+
| 100 | 0.0 |
| 100 | 0.0 |
| 50 | 0.666667 |
| 25 | 1.0 |
+--------+--------------+
```
+--------+--------------+
| points | percent_rank |
+--------+--------------+
| 100 | 0.0 |
| 100 | 0.0 |
| 50 | 0.666667 |
| 25 | 1.0 |
+--------+--------------+
To set window function parameters use the window builder approach described in the
ref:`_window_functions` online documentation.
Expand All @@ -1890,18 +1885,16 @@ def cume_dist() -> Expr:
This window function is similar to :py:func:`rank` except that the returned values
are the ratio of the row number to the total numebr of rows. Here is an example of a
dataframe with a window ordered by descending ``points`` and the associated
cumulative distribution.
cumulative distribution::
```
+--------+-----------+
| points | cume_dist |
+--------+-----------+
| 100 | 0.5 |
| 100 | 0.5 |
| 50 | 0.75 |
| 25 | 1.0 |
+--------+-----------+
```
+--------+-----------+
| points | cume_dist |
+--------+-----------+
| 100 | 0.5 |
| 100 | 0.5 |
| 50 | 0.75 |
| 25 | 1.0 |
+--------+-----------+
To set window function parameters use the window builder approach described in the
ref:`_window_functions` online documentation.
Expand All @@ -1915,23 +1908,20 @@ def ntile(groups: int) -> Expr:
This window function orders the window frame into a give number of groups based on
the ordering criteria. It then returns which group the current row is assigned to.
Here is an example of a dataframe with a window ordered by descending ``points``
and the associated n-tile function.
```
+--------+-------+
| points | ntile |
+--------+-------+
| 120 | 1 |
| 100 | 1 |
| 80 | 2 |
| 60 | 2 |
| 40 | 3 |
| 20 | 3 |
+--------+-------+
```
and the associated n-tile function::
+--------+-------+
| points | ntile |
+--------+-------+
| 120 | 1 |
| 100 | 1 |
| 80 | 2 |
| 60 | 2 |
| 40 | 3 |
| 20 | 3 |
+--------+-------+
To set window function parameters use the window builder approach described in the
ref:`_window_functions` online documentation.
"""
# Developer note: ntile only accepts literal values.
return Expr(f.ntile(Expr.literal(groups).expr))

0 comments on commit 35547c8

Please sign in to comment.