Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(7181): add cursor slicing #7798

Closed
wants to merge 7 commits into from

Conversation

wiedld
Copy link
Contributor

@wiedld wiedld commented Oct 11, 2023

Which issue does this PR close?

Adds cursor slicing as a prerequisite for the cascading merge.

Part of #7181

Rationale for this change

The need for a sliced cursor is described here, in it's later use of partially yielded record batches.

What changes are included in this PR?

  • Arc around the Rows in RowCursor, so can arc clone on slice.
  • define and impl slice() in Cursor interface
  • test for primitive cursor
  • define and impl num_rows() in Cursor interface. Used here and later in the cascaded merge.

Are these changes tested?

yes
Primitive cursor slicing is unit tested here.
Row cursor slicing is tested/used in the cascading merge.

Are there any user-facing changes?

No. Cursor interface is crate private.

@wiedld wiedld marked this pull request as ready for review October 11, 2023 22:28
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting closer, I think there is still an issue with the way this handles the null_threshold

datafusion/physical-plan/src/sorts/cursor.rs Outdated Show resolved Hide resolved
assert_eq!(a, b);
assert_eq!(a.cmp(&b), Ordering::Equal);

// 2 > NULL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// 2 > NULL
// i32::MIN > NULL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the null mask to work properly. Explicit test cases pushed. Let me know if it's correct this time.

datafusion/physical-plan/src/sorts/cursor.rs Outdated Show resolved Hide resolved
datafusion/physical-plan/src/sorts/cursor.rs Outdated Show resolved Hide resolved
datafusion/physical-plan/src/sorts/cursor.rs Show resolved Hide resolved
Self {
values: values.slice(offset, length),
offset: 0,
null_threshold: null_threshold.checked_sub(offset).unwrap_or(0),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect this logic to depend on the null ordering. In particular I would expect if nulls are first, to decrement by offset, and otherwise by self.len - offset - length or something...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calculation is different, and a bit more explicit. Lmk if ok.

datafusion/physical-plan/src/sorts/cursor.rs Show resolved Hide resolved
fn slice(&self, offset: usize, length: usize) -> Self {
let FieldCursor {
values,
offset: _,
Copy link
Contributor

@tustvold tustvold Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems at odds with the behaviour of RowCursor, which takes the current offset into account

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove the data slicing of the underlying FieldCursor.values. That slicing is a zero-copy of the underlying ScalarBuffer or GenericByteArray.

Would you prefer a switch to using FieldCursor offsets in the same as the RowCursor?

Copy link
Contributor

@tustvold tustvold Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see an issue with slicing the underlying values, my observation is that the following will behave differently between RowCursor and FieldCursor

cursor.advance();
cursor.slice(1, 2);

In the case of RowCursor it will produce a slice that is offset by 2 from the start, whereas FieldCursor will produce one that is only offset by 1? I think...

It should just be a case of changing this method to use self.offset + offset instead of just offset

Comment on lines +348 to +349
let shorter_len = self.values.len().saturating_sub(offset + length + 1);
null_threshold.saturating_sub(offset.saturating_sub(shorter_len))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think about this more, I am unsure why null_threshold.saturating_sub(offset) is incorrect

@@ -284,6 +333,34 @@ impl<T: FieldValues> Cursor for FieldCursor<T> {
self.offset += 1;
t
}

fn slice(&self, offset: usize, length: usize) -> Self {
Copy link
Contributor

@tustvold tustvold Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if this method was simply

Self {
    values: self.values.slice(0, self.offset + offset + length),
    offset: self.offset + offset
    null_threshold: self.null_threshold,
}

Or equivalently (I think)

Self {
    values: self.values.slice(offset + self.offset, length),
    offset: 0
    null_threshold: self.null_threshold.saturating_sub(offset + self.offset),
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RowCursor slicing does not slice the underlying rows, therefore it is self.offset + offset.
Whereas the FieldCursor does slice the underlying data, and therefore the offset is reset to 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the logic below simply ignores the value of self.offset?

@wiedld
Copy link
Contributor Author

wiedld commented Oct 19, 2023

We changed the abstractions, and are now separating the Cursor from the CursorValues. After this PR merges, will add slicing to the CursorValues.

@wiedld wiedld closed this Oct 19, 2023
@wiedld wiedld deleted the 7181/add-cursor-slicing branch October 24, 2023 06:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants