-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix missing table rows #110
Conversation
@Swiftb0y I would appreciate your review :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a couple mostly superficial comments. I have not dug deeply into Deep-Symmetry/crate-digger#32 so I can't really assert how well this implementation aligns with the spec.
Very happy to see some activity here btw.
tests/test_pdb_num_rows.rs
Outdated
let actual_row_count: usize = pages | ||
.into_iter() | ||
.flat_map(|page| page.row_groups.into_iter()) | ||
.map(|row_group| row_group.present_rows().collect::<Vec<Row>>().len()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid temporary objects.
.map(|row_group| row_group.present_rows().collect::<Vec<Row>>().len()) | |
.map(|row_group| row_group.present_rows().count()) |
pub struct RowGroup { | ||
/// An offset which points to a row in the table, whose actual presence is controlled by one of the | ||
/// bits in `row_present_flags`. This instance allows the row itself to be lazily loaded, unless it | ||
/// is not present, in which case there is no content to be loaded. | ||
#[br(args { count: num_rows.into(), inner: FilePtrArgs { inner: (page_type,), offset: page_heap_offset }})] | ||
rows: Vec<FilePtr16<Row>>, | ||
rows: [Option<FilePtr16<Row>>; Self::MAX_ROW_COUNT], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you considered https://docs.rs/binrw/latest/binrw/file_ptr/index.html#best-practices ?
src/pdb/mod.rs
Outdated
} else { | ||
needs_seek = true; | ||
} | ||
} | ||
|
||
Ok(RowGroup { | ||
rows, | ||
row_presence_flags, | ||
unknown, | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to restore the seek position afterwards? To where?
src/pdb/mod.rs
Outdated
for i in (0..RowGroup::MAX_ROW_COUNT).rev() { | ||
let row_present = row_presence_flags & (1 << i) != 0; | ||
if row_present { | ||
if needs_seek { | ||
let index = u64::try_from(i).map_err(|_| binrw::Error::AssertFail { | ||
pos: stream_position, | ||
message: format!("Failed to calculate row index {}", i), | ||
})?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the try_from
seems unnecessary. just cast it to u64
upfront. Interestingly rustc
does not catch integer overflow when casting constants to narrower types... That seems like a compiler bug? Anyways, we're widening here so its fine IMO.
for i in (0..RowGroup::MAX_ROW_COUNT).rev() { | |
let row_present = row_presence_flags & (1 << i) != 0; | |
if row_present { | |
if needs_seek { | |
let index = u64::try_from(i).map_err(|_| binrw::Error::AssertFail { | |
pos: stream_position, | |
message: format!("Failed to calculate row index {}", i), | |
})?; | |
for index in (0..RowGroup::MAX_ROW_COUNT as u64).rev() { | |
let row_present = row_presence_flags & (1 << index) != 0; | |
if row_present { | |
if needs_seek { |
}); | ||
} | ||
|
||
let mut needs_seek = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the "seek caching" necessary here? Can we simplify the code by always seeking unconditionally even if it turns out to be a noop?
src/pdb/mod.rs
Outdated
let estimated_number_of_row_groups = usize::from(num_rows) | ||
.checked_sub(1) | ||
.map(|x| x / RowGroup::MAX_ROW_COUNT) | ||
.and_then(|x| x.checked_add(1)) | ||
.ok_or_else(|| binrw::Error::AssertFail { | ||
pos: stream_position, | ||
message: "Failed to calculate number of row groups".to_string(), | ||
})?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't see how this could go wrong. Do you?
let estimated_number_of_row_groups = usize::from(num_rows) | |
.checked_sub(1) | |
.map(|x| x / RowGroup::MAX_ROW_COUNT) | |
.and_then(|x| x.checked_add(1)) | |
.ok_or_else(|| binrw::Error::AssertFail { | |
pos: stream_position, | |
message: "Failed to calculate number of row groups".to_string(), | |
})?; | |
let estimated_number_of_row_groups = usize::from(num_rows).div_ceil(RowGroup::MAX_ROW_COUNT); |
u64::try_from(row_groups.len()) | ||
.ok() | ||
.and_then(|index| index.checked_mul(36)) | ||
.and_then(|x| stream_position.checked_sub(x)) | ||
.map(SeekFrom::Start) | ||
.ok_or_else(|| binrw::Error::AssertFail { | ||
pos: stream_position, | ||
message: format!("Failed to calculate seek position for row group {}", i), | ||
})?, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm a little narrow minded from too much C++ programming but this feels overcautious. row_groups.len()
is a usize
and will pretty much always fit into a u64
, right? Are there architectures where usize is bigger than a u64? can we static_assert
that (is there such a thing in rust
)?
The questions Jan Holthuis raised in the Crate Digger issue make me concerned that we are not calculating the number of row groups correctly, and perhaps we simply can’t calculate that in advance at all, because we don’t know how many rows have been invalidated in each row group, so we may simply need to scan row groups until we have seen |
fyi, I can't resolve my review comments, so take the entire review as addressed after merging #111 |
* rename `stream_position` to `row_group_end_position` * formulate `needs_seek` in terms of `row_present`, making it clear that reading the next row is only needed when the current one is not not present.
Fixes #107.
Also see Deep-Symmetry/crate-digger#32.