-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(7181): add cursor slicing #7798
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is getting closer, I think there is still an issue with the way this handles the null_threshold
assert_eq!(a, b); | ||
assert_eq!(a.cmp(&b), Ordering::Equal); | ||
|
||
// 2 > NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// 2 > NULL | |
// i32::MIN > NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the null mask to work properly. Explicit test cases pushed. Let me know if it's correct this time.
Self { | ||
values: values.slice(offset, length), | ||
offset: 0, | ||
null_threshold: null_threshold.checked_sub(offset).unwrap_or(0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect this logic to depend on the null ordering. In particular I would expect if nulls are first, to decrement by offset, and otherwise by self.len - offset - length
or something...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calculation is different, and a bit more explicit. Lmk if ok.
…lls_first and nulls_last slicing
fn slice(&self, offset: usize, length: usize) -> Self { | ||
let FieldCursor { | ||
values, | ||
offset: _, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems at odds with the behaviour of RowCursor, which takes the current offset into account
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove the data slicing of the underlying FieldCursor.values
. That slicing is a zero-copy of the underlying ScalarBuffer or GenericByteArray.
Would you prefer a switch to using FieldCursor offsets in the same as the RowCursor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see an issue with slicing the underlying values, my observation is that the following will behave differently between RowCursor and FieldCursor
cursor.advance();
cursor.slice(1, 2);
In the case of RowCursor
it will produce a slice that is offset by 2 from the start, whereas FieldCursor will produce one that is only offset by 1? I think...
It should just be a case of changing this method to use self.offset + offset
instead of just offset
let shorter_len = self.values.len().saturating_sub(offset + length + 1); | ||
null_threshold.saturating_sub(offset.saturating_sub(shorter_len)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I think about this more, I am unsure why null_threshold.saturating_sub(offset)
is incorrect
@@ -284,6 +333,34 @@ impl<T: FieldValues> Cursor for FieldCursor<T> { | |||
self.offset += 1; | |||
t | |||
} | |||
|
|||
fn slice(&self, offset: usize, length: usize) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if this method was simply
Self {
values: self.values.slice(0, self.offset + offset + length),
offset: self.offset + offset
null_threshold: self.null_threshold,
}
Or equivalently (I think)
Self {
values: self.values.slice(offset + self.offset, length),
offset: 0
null_threshold: self.null_threshold.saturating_sub(offset + self.offset),
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RowCursor slicing does not slice the underlying rows, therefore it is self.offset + offset
.
Whereas the FieldCursor does slice the underlying data, and therefore the offset is reset to 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the logic below simply ignores the value of self.offset
?
We changed the abstractions, and are now separating the Cursor from the CursorValues. After this PR merges, will add slicing to the CursorValues. |
Which issue does this PR close?
Adds cursor slicing as a prerequisite for the cascading merge.
Part of #7181
Rationale for this change
The need for a sliced cursor is described here, in it's later use of partially yielded record batches.
What changes are included in this PR?
slice()
in Cursor interfacenum_rows()
in Cursor interface. Used here and later in the cascaded merge.Are these changes tested?
yes
Primitive cursor slicing is unit tested here.
Row cursor slicing is tested/used in the cascading merge.
Are there any user-facing changes?
No. Cursor interface is crate private.