New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

feat(7181): add cursor slicing #7798

Closed

wiedld wants to merge 7 commits into apache:main from wiedld:7181/add-cursor-slicing

Contributor

wiedld commented Oct 11, 2023 •

edited

Loading

Which issue does this PR close?

Adds cursor slicing as a prerequisite for the cascading merge.

Part of #7181

Rationale for this change

The need for a sliced cursor is described here, in it's later use of partially yielded record batches.

What changes are included in this PR?

Arc around the Rows in RowCursor, so can arc clone on slice.
define and impl slice() in Cursor interface
test for primitive cursor
define and impl num_rows() in Cursor interface. Used here and later in the cascaded merge.

Are these changes tested?

yes
Primitive cursor slicing is unit tested here.
Row cursor slicing is tested/used in the cascading merge.

Are there any user-facing changes?

No. Cursor interface is crate private.

wiedld added 3 commits

October 11, 2023 17:24


          feat(7181): add slice() to Cursor trait

700a53d


          fix(7181): have RowCursor slicing be within the a single arc-refed Rows

0c66196


          test(7181): cursor slicing

78e154c

wiedld marked this pull request as ready for review

October 11, 2023 22:28

tustvold reviewed

View reviewed changes

datafusion/physical-plan/src/sorts/cursor.rs Outdated Show resolved Hide resolved

tustvold reviewed

View reviewed changes

datafusion/physical-plan/src/sorts/cursor.rs Outdated Show resolved Hide resolved

wiedld added 2 commits

October 12, 2023 13:56


          fix(7181): cursor slice should panic, not return result

6a946a5


          fix(7181): handle nulls_first sort option, with cursor slicing

tustvold reviewed

View reviewed changes

Contributor

tustvold left a comment

This is getting closer, I think there is still an issue with the way this handles the null_threshold

datafusion/physical-plan/src/sorts/cursor.rs Outdated Show resolved Hide resolved

datafusion/physical-plan/src/sorts/cursor.rs Outdated

+                      assert_eq!(a, b);
+                      assert_eq!(a.cmp(&b), Ordering::Equal);
+                      // 2 > NULL

Contributor

tustvold Oct 13, 2023

Suggested change

      
                    // 2 > NULL
          
                    // i32::MIN > NULL

Contributor Author

wiedld Oct 16, 2023

Fixed the null mask to work properly. Explicit test cases pushed. Let me know if it's correct this time.

datafusion/physical-plan/src/sorts/cursor.rs Outdated Show resolved Hide resolved

datafusion/physical-plan/src/sorts/cursor.rs Outdated Show resolved Hide resolved

datafusion/physical-plan/src/sorts/cursor.rs Show resolved Hide resolved

datafusion/physical-plan/src/sorts/cursor.rs Outdated

+                      Self {
+                          values: values.slice(offset, length),
+                          offset: 0,
+                          null_threshold: null_threshold.checked_sub(offset).unwrap_or(0),

Contributor

tustvold Oct 13, 2023

I would expect this logic to depend on the null ordering. In particular I would expect if nulls are first, to decrement by offset, and otherwise by self.len - offset - length or something...

Contributor Author

wiedld Oct 16, 2023

Calculation is different, and a bit more explicit. Lmk if ok.

datafusion/physical-plan/src/sorts/cursor.rs Show resolved Hide resolved

wiedld added 2 commits

October 15, 2023 19:37


          fix(7181): proper define null mask in test case, and fix tests for nu…

ccaf130

…lls_first and nulls_last slicing


          Merge branch 'main' into 7181/add-cursor-slicing

f04930d

tustvold reviewed

View reviewed changes

datafusion/physical-plan/src/sorts/cursor.rs

+                  fn slice(&self, offset: usize, length: usize) -> Self {
+                      let FieldCursor {
+                          values,
+                          offset: _,

Contributor

tustvold Oct 17, 2023 •

edited

Loading

This seems at odds with the behaviour of RowCursor, which takes the current offset into account

Contributor Author

wiedld Oct 17, 2023

We can remove the data slicing of the underlying FieldCursor.values. That slicing is a zero-copy of the underlying ScalarBuffer or GenericByteArray.

Would you prefer a switch to using FieldCursor offsets in the same as the RowCursor?

Contributor

tustvold Oct 17, 2023 •

edited

Loading

I don't see an issue with slicing the underlying values, my observation is that the following will behave differently between RowCursor and FieldCursor

cursor.advance();
cursor.slice(1, 2);

In the case of RowCursor it will produce a slice that is offset by 2 from the start, whereas FieldCursor will produce one that is only offset by 1? I think...

It should just be a case of changing this method to use self.offset + offset instead of just offset

datafusion/physical-plan/src/sorts/cursor.rs

Comment on lines +348 to +349

		let shorter_len = self.values.len().saturating_sub(offset + length + 1);
		null_threshold.saturating_sub(offset.saturating_sub(shorter_len))

Contributor

tustvold Oct 17, 2023

Now that I think about this more, I am unsure why null_threshold.saturating_sub(offset) is incorrect

tustvold reviewed

View reviewed changes

datafusion/physical-plan/src/sorts/cursor.rs

@@ @@ -284,6 +333,34 @@ impl<T: FieldValues> Cursor for FieldCursor<T> { @@
                       self.offset += 1;
                       t
                   }
+                  fn slice(&self, offset: usize, length: usize) -> Self {

Contributor

tustvold Oct 17, 2023 •

edited

Loading

What would happen if this method was simply

Self {
    values: self.values.slice(0, self.offset + offset + length),
    offset: self.offset + offset
    null_threshold: self.null_threshold,
}

Or equivalently (I think)

Self {
    values: self.values.slice(offset + self.offset, length),
    offset: 0
    null_threshold: self.null_threshold.saturating_sub(offset + self.offset),
}

Contributor Author

wiedld Oct 17, 2023

The RowCursor slicing does not slice the underlying rows, therefore it is self.offset + offset.
Whereas the FieldCursor does slice the underlying data, and therefore the offset is reset to 0.

Contributor

tustvold Oct 17, 2023

But the logic below simply ignores the value of self.offset?

tustvold mentioned this pull request

Add CursorValues Decoupling Cursor Data from Cursor Position #7855

Merged

Contributor Author

wiedld commented Oct 19, 2023

We changed the abstractions, and are now separating the Cursor from the CursorValues. After this PR merges, will add slicing to the CursorValues.

wiedld closed this

wiedld deleted the 7181/add-cursor-slicing branch

October 24, 2023 06:27

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet