Skip to content

Commit

Permalink
Fix and clarify notes on result ordering (#13255)
Browse files Browse the repository at this point in the history
I noticed when answering #13254 that the code example in this section of our documentation was incorrect and the text itself could use some improving.

Authors:
  - Ashwin Srinath (https://github.com/shwina)
  - Lawrence Mitchell (https://github.com/wence-)
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #13255
  • Loading branch information
shwina authored Apr 12, 2024
1 parent 8506ea6 commit ff22a7a
Showing 1 changed file with 19 additions and 8 deletions.
27 changes: 19 additions & 8 deletions docs/cudf/source/user_guide/pandas-comparison.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,9 +87,17 @@ using `.from_arrow()` or `.from_pandas()`.

## Result ordering

By default, `join` (or `merge`), `value_counts` and `groupby` operations in cuDF
do *not* guarantee output ordering.
Compare the results obtained from Pandas and cuDF below:
In Pandas, `join` (or `merge`), `value_counts` and `groupby` operations provide
certain guarantees about the order of rows in the result returned. In a Pandas
`join`, the order of join keys is (depending on the particular style of join
being performed) either preserved or sorted lexicographically by default.
`groupby` sorts the group keys, and preserves the order of rows within each
group. In some cases, disabling this option in Pandas can yield better
performance.

By contrast, cuDF's default behavior is to return rows in a
non-deterministic order to maximize performance. Compare the results
obtained from Pandas and cuDF below:

```{code} python
>>> import cupy as cp
Expand All @@ -114,13 +122,16 @@ a
4 342.000000
```

To match Pandas behavior, you must explicitly pass `sort=True`
or enable the `mode.pandas_compatible` option when trying to
match Pandas behavior with `sort=False`:
In most cases, the rows of a DataFrame are accessed by index labels
rather than by position, so the order in which rows are returned
doesn't matter. However, if you require that results be returned in a
predictable (sorted) order, you can pass the `sort=True` option
explicitly or enable the `mode.pandas_compatible` option when trying
to match Pandas behavior with `sort=False`:

```{code} python
>>> df.to_pandas().groupby("a", sort=True).mean().head()
b
>>> df.groupby("a", sort=True).mean().head()
b
a
0 70.000000
1 356.333333
Expand Down

0 comments on commit ff22a7a

Please sign in to comment.