Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CursorValues Decoupling Cursor Data from Cursor Position #7855

Merged
merged 5 commits into from
Oct 19, 2023

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

#7798 proposed making the cursors themselves sliceable, this resulted in a potentially surprising interface where slicing a cursor would reset its position. This is necessary because a cascading merge sort needs to re-visit rows from previous passes.

Instead of introducing a notion of cursor slicing, this PR instead separates the cursor out from the cursor's values. This in turn will allow constructing cursors multiple times on the same set of values, and potentially adding the ability to slice said values. I think this will be easier to follow.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

No, all of these details are crate private

@tustvold tustvold changed the title Decouple cursor storage Add CursorValues Decoupling Cursor Data from Cursor Position Oct 18, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tustvold (and by extension @wiedld ). I think this improves readability significantly.

I left some comment suggestions to make it even better

It is great to see motion here


rows: Rows,
fn eq(l: &Self, l_idx: usize, r: &Self, r_idx: usize) -> bool;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please document what these functions mean (even though you could argue they are obvious -- it might help to explain what is expected if l_idx > len, for example.

datafusion/physical-plan/src/sorts/cursor.rs Show resolved Hide resolved
datafusion/physical-plan/src/sorts/cursor.rs Show resolved Hide resolved
datafusion/physical-plan/src/sorts/cursor.rs Show resolved Hide resolved
pub trait FieldArray: Array + 'static {
type Values: FieldValues;
type Values: CursorValues;

fn values(&self) -> Self::Values;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given the overload of the use of the term values (e.g. for Dictionary arrays) perhaps we could rename this to cursor_values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Member functions naturally take precedence over trait methods, so should ambiguity arise, it will necessitate the explicit form CursorArray::values(...) and so I don't think this is an issue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't thinking about compiler ambiguities, more like cognitive overload for future readers. I agree it is not critical however

datafusion/physical-plan/src/sorts/cursor.rs Outdated Show resolved Hide resolved
}
}

/// An [`Array`] that can be converted into [`FieldValues`]
/// An [`Array`] that can be converted into [`CursorValues`]
pub trait FieldArray: Array + 'static {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the use of Field confusing here -- what about renaming this CursorArray or IntoCursorValues?

///
/// Note: comparing cursors with different `SortOptions` will yield an arbitrary ordering
#[derive(Debug)]
pub struct FieldCursor<T: FieldValues> {
pub struct ArrayValues<T: CursorValues> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice naming change

}
}

/// A collection of sorted, nullable [`FieldArray`]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// A collection of sorted, nullable [`FieldArray`]
/// A collection of sorted, nullable [`CursorValues`]

@@ -89,7 +89,7 @@ pub(crate) struct SortPreservingMergeStream<C> {
batch_size: usize,

/// Vector that holds cursors for each non-exhausted input partition
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Vector that holds cursors for each non-exhausted input partition
/// Cursors for each input partition. `None` means the input is exhausted

@tustvold tustvold merged commit 37d6bf0 into apache:main Oct 19, 2023
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants