Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix RowSelection::intersection (#5036) #5041

Merged
merged 2 commits into from
Nov 7, 2023

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Nov 6, 2023

Which issue does this PR close?

Closes #5036

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Nov 6, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tustvold -- this is very nice.

An invariant of RowSelection is that it alternates select and skip, and does not contain empty RowSelector.

I like how you have added documenting this assumption now. Can you explain where this assumption is used in the code?

cc @Ted-Jiang

/// A [`RowSelection`] maintains the following invariants:
///
/// * It contains no [`RowSelector`] of 0 rows
/// * Consecutive [`RowSelector`] alternate whether they skip or select rows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// * Consecutive [`RowSelector`] alternate whether they skip or select rows
/// * Consecutive [`RowSelector`]s alternate skipping or selecting rows.

continue;
}

let last = selectors.last_mut().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let last = selectors.last_mut().unwrap();
// Combine adjacent skip/non-skip
let last = selectors.last_mut().unwrap();

@@ -118,10 +123,13 @@ impl RowSelection {
let mut last_end = 0;
for range in ranges {
let len = range.end - range.start;
if len == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the actual bug fix? It also looks like from_selectors_and_combine didn't remove zero length RowSelectors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is part of it, the other part is actually invoking the from_selectors_and_combine logic whilst performing an intersection

r_iter.next().unwrap();
}
(Some(a), Some(b)) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the switch from l,r to a,b confusing -- can we use one consistently throughout this function? Either would be better than a mix

}
res
}
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed this logic carefully -- it is very cool

@tustvold
Copy link
Contributor Author

tustvold commented Nov 6, 2023

Can you explain where this assumption is used in the code

From the linked ticket #5036

The async reader determines what data to fetch based on what rows are selected, however, when reading the data it performs each operation in turn. In order to perform the first skip, the reader must set up the decoders to the relevant position within the pages (as it doesn't know that the next operation is another skip). This in turn causes it to request data that wasn't fetched, and the reader bails out with an offset index error.

@alamb
Copy link
Contributor

alamb commented Nov 6, 2023

Can you explain where this assumption is used in the code

From the linked ticket #5036

The async reader determines what data to fetch based on what rows are selected, however, when reading the data it performs each operation in turn. In order to perform the first skip, the reader must set up the decoders to the relevant position within the pages (as it doesn't know that the next operation is another skip). This in turn causes it to request data that wasn't fetched, and the reader bails out with an offset index error.

I see -- so another potential fix would be to ignore zero length selections in the reader. I am not saying we should do this, only observing it might be a possiblility

@tustvold
Copy link
Contributor Author

tustvold commented Nov 6, 2023

ignore zero length selections in the reader

That would only half fix the problem, as if you have two consecutive skip, you will have the same problem

Copy link
Member

@Ted-Jiang Ted-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@tustvold tustvold merged commit 20f10dc into apache:master Nov 7, 2023
16 checks passed
alamb pushed a commit to alamb/arrow-rs that referenced this pull request Nov 7, 2023
* Fix RowSelection::intersection (apache#5036)

* Review feedback
alamb pushed a commit to alamb/arrow-rs that referenced this pull request Nov 8, 2023
* Fix RowSelection::intersection (apache#5036)

* Review feedback
alamb added a commit that referenced this pull request Nov 8, 2023
* Fix RowSelection::intersection (#5036)

* Review feedback

Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RowSelection::intersection Produces Invalid RowSelection
3 participants