Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default window frames to not match PostgreSQL #688

Closed
timsaucer opened this issue May 13, 2024 · 4 comments · Fixed by #871
Closed

Default window frames to not match PostgreSQL #688

timsaucer opened this issue May 13, 2024 · 4 comments · Fixed by #871
Labels
bug Something isn't working

Comments

@timsaucer
Copy link
Contributor

Describe the bug
When no window frame is specified in the python implementation, we default to unbounded preceeding to current row. If we are to follow PostgreSQL implementation then we should set this value when order_by is specified and otherwise default to unbounded preceeding to unbounded following.

To Reproduce

from datafusion import SessionContext, WindowFrame, col, lit, functions as F
import pyarrow as pa

ctx = SessionContext()

# create a RecordBatch and a new DataFrame from it
batch = pa.RecordBatch.from_arrays(
    [pa.array([1.0, 10.0, 20.0])],
    names=["a"],
)

df = ctx.create_dataframe([[batch]])

window_frame = WindowFrame("rows", None, None)

df = df.select(col("a"), F.window("avg", [col("a")]).alias('no_frame'), F.window("avg", [col("a")], window_frame=window_frame).alias('with_frame'))

df.show()

Produces:

DataFrame()
+------+--------------------+--------------------+
| a    | no_frame           | with_frame         |
+------+--------------------+--------------------+
| 1.0  | 1.0                | 10.333333333333334 |
| 10.0 | 5.5                | 10.333333333333334 |
| 20.0 | 10.333333333333334 | 10.333333333333334 |
+------+--------------------+--------------------+

Expected behavior
When order_by is not specified, default to unbounded preceeding to unbounded following.

Additional context
The offending line of code appears to be here:

https://github.com/apache/datafusion-python/blob/main/src/functions.rs#L230

@timsaucer timsaucer added the bug Something isn't working label May 13, 2024
@timsaucer
Copy link
Contributor Author

This seems like a very easy fix. I think we just match on order_by.is_some() and pick the appropriate default. To close, I think at a minimum we would want (1) a unit test to cover both options and (2) text in the window function that describes this behavior.

@timsaucer
Copy link
Contributor Author

Based on recommendation on discord, we may want to use WindowFrame::new() which has the logic already for checking if order_by exists.

@timsaucer
Copy link
Contributor Author

From @Michael-J-Ward

Doing a little archaelogy on that:

This is the PR where window_frame switched from None to WindowFrame::new(order_by.is_some());

The PR discussion has some talk on test regressions related to it.

https://github.com/apache/datafusion-python/pull/115/files#diff-676f1190ad1b28e57e9eac75de61cccc59420615e3da10f91a493fb48d0711ff

@timsaucer
Copy link
Contributor Author

Plan: Also allow for explicit setting of window frame bounds so we can do things like rows from preceeding unbounded to preceeding 1. That would allow us to do things like "get the last non-null value before the current row". The current interface cannot do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant