Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement GenericByteViewArray::gc for compacting StringViewArray and ByteViewArray #5707

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ClSlaid
Copy link
Contributor

@ClSlaid ClSlaid commented May 2, 2024

Which issue does this PR close?

Closes #5513.

This PR is being splited into:

Rationale for this change

What changes are included in this PR?

Adding a new gc method for StringViewArray and ByteViewArray.

Are there any user-facing changes?

None.

part 1: implement checker to check if current buffer is compact

Signed-off-by: 蔡略 <cailue@bupt.edu.cn>
part2: implement actual gc algorithm

Signed-off-by: 蔡略 <cailue@bupt.edu.cn>
@github-actions github-actions bot added the arrow Changes to the arrow crate label May 2, 2024
@ClSlaid ClSlaid marked this pull request as draft May 2, 2024 08:35
Signed-off-by: 蔡略 <cailue@bupt.edu.cn>
@ClSlaid ClSlaid marked this pull request as ready for review May 2, 2024 08:41
@ClSlaid
Copy link
Contributor Author

ClSlaid commented May 2, 2024

The naming of gc might persuade people that it should modify itself. So should I:

making gc in place and edit itself

ou or

renaming gc to compact?

@alamb
Copy link
Contributor

alamb commented May 2, 2024

Thank you -- looking forward to this one.

Comment on lines +595 to +606
/// A helper struct that used to check if the array is compact view
///
/// The checker is lazy and will not check the array until `finish` is called.
///
/// This is based on the assumption that the array will most likely to be not compact,
/// so it is likely to scan the entire array.
///
/// Then it is better to do the check at once, rather than doing it for each accumulate operation.
struct CompactChecker {
length: usize,
intervals: BTreeMap<usize, usize>,
}
Copy link
Contributor Author

@ClSlaid ClSlaid May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is a better algorithm for this.
It's the most straightforward and simplest way I could come up with, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend we pull the "compact checking" algorithm into a new PR to discuss it -- I am not sure about the assumption that StringViewArrays will mostly often be compacted (I would actually expect the opposite)

Signed-off-by: 蔡略 <cailue@bupt.edu.cn>
}

let mut new_views = Vec::with_capacity(self.views.len());
let mut new_bufs: Vec<Vec<u8>> = vec![vec![]; self.buffers.len()];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of buffers may shrink after gc. Every buffer should be filled up to block_size.

See GenericByteViewBuilder::append_value.

@alamb alamb changed the title feat: byte view array gc Implement GenericByteViewArray::gc for compacting StringViewArray and ByteViewArray May 4, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contributon @ClSlaid and for the review @RinChanNOWWW

My biggest suggestion is to split the code in this PR into two PRs -- one for the "compact_check" and one for the actual compaction.

I had some suggestions on some additional coverage needed for the gc, but I think it is looking quite close.

Thanks again!

Comment on lines +595 to +606
/// A helper struct that used to check if the array is compact view
///
/// The checker is lazy and will not check the array until `finish` is called.
///
/// This is based on the assumption that the array will most likely to be not compact,
/// so it is likely to scan the entire array.
///
/// Then it is better to do the check at once, rather than doing it for each accumulate operation.
struct CompactChecker {
length: usize,
intervals: BTreeMap<usize, usize>,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend we pull the "compact checking" algorithm into a new PR to discuss it -- I am not sure about the assumption that StringViewArrays will mostly often be compacted (I would actually expect the opposite)

Comment on lines +951 to +953
let mut view = ByteView::from(array.views[1]);
view.length = 15;
let new_views = ScalarBuffer::from(vec![array.views[0], view.into()]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dn't undertand what this is doing

I think you can more easily create a stringview with multiple buffers like this:

// Use a small capacity so we end up with multiple views
let mut builder = StringViewBuilder::with_capacity(20);
builder.append_value("hello");
builder.append_null();
builder.append_value("longer than 12 bytes");
builder.append_value("another than 12 bytes");
builder.append_null();
builder.append_value("small");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not familiar with the rest of the code, so I made it out brutally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that StringViewArrays will mostly often be compacted (I would actually expect the opposite)

We assume the same idea, it's likely to be not compacted.

@ClSlaid
Copy link
Contributor Author

ClSlaid commented May 4, 2024

Thanks for your review @alamb, and I'd left this PR to a draft and reopen two PRs for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add gcgarbage collector support for StringViewArray and BinaryViewArray
3 participants