Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implements flexsym annotation reading, adds binary 1.1 roundtrip unit tests #788

Merged
merged 9 commits into from
Jun 10, 2024

Conversation

zslayton
Copy link
Contributor

@zslayton zslayton commented Jun 7, 2024

The binary 1.1 reader can now read FlexSym-encoded annotations sequences. I have also added some application-level roundtripping unit tests to identify places where the reader and writer may disagree. Because the tests are application-level, the writer generates an LST for several of the tests; quite a bit of the API is being exercised.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@zslayton zslayton requested a review from nirosys June 7, 2024 20:57
Copy link
Contributor Author

@zslayton zslayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ PR tour 🧭

@@ -745,7 +747,7 @@ impl<'a> ImmutableBuffer<'a> {
}

lazy_value.encoded_value.annotations_header_length = wrapper.header_length;
lazy_value.encoded_value.annotations_sequence_length = wrapper.sequence_length;
lazy_value.encoded_value.annotations_sequence_length = wrapper.sequence_length as u16;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ I widened the annotations sequence length from a u8 to a u16 since the annotations in v1.1 may have inline text.

pub fn read_annotations_wrapper(&self, _opcode: Opcode) -> ParseResult<'a, AnnotationsWrapper> {
todo!();
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ I removed this stub (that I added a while ago) because v1.1 annotations don't involve a 'wrapper' like v1.0 annotations.

Comment on lines +342 to +348
value.encoded_value.annotations_header_length = annotations_seq.header_length;
value.encoded_value.annotations_sequence_length = annotations_seq.sequence_length;
value.encoded_value.annotations_encoding = annotations_seq.encoding;
value.encoded_value.total_length +=
annotations_seq.header_length as usize + annotations_seq.sequence_length as usize;
// Rewind the input to include the annotations sequence
value.input = self;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ After reading the value, this method goes back and overwrites all of the annotations-related or -affected fields.

Comment on lines +539 to +549
#[case("2023-10-15T05:04Z", &[0x83, 0x35, 0x7D, 0x85, 0x00])]
#[case("2023-10-15T05:04:03Z", &[0x84, 0x35, 0x7D, 0x85, 0x30, 0x00])]
#[case("2023-10-15T05:04:03.123-00:00", &[0x85, 0x35, 0x7D, 0x85, 0x38, 0xEC, 0x01])]
#[case("2023-10-15T05:04:03.000123-00:00", &[0x86, 0x35, 0x7D, 0x85, 0x38, 0xEC, 0x01, 0x00])]
#[case("2023-10-15T05:04:03.000000123-00:00", &[0x87, 0x35, 0x7D, 0x85, 0x38, 0xEC, 0x01, 0x00, 0x00])]
#[case("2023-10-15T05:04+01:00", &[0x88, 0x35, 0x7D, 0x85, 0xE0, 0x01])]
#[case("2023-10-15T05:04-01:00", &[0x88, 0x35, 0x7D, 0x85, 0xA0, 0x01])]
#[case("2023-10-15T05:04:03+01:00", &[0x89, 0x35, 0x7D, 0x85, 0xE0, 0x0D])]
#[case("2023-10-15T05:04:03.123+01:00", &[0x8A, 0x35, 0x7D, 0x85, 0xE0, 0x0D, 0x7B, 0x00])]
#[case("2023-10-15T05:04:03.000123+01:00", &[0x8B, 0x35, 0x7D, 0x85, 0xE0, 0x0D, 0x7B, 0x00, 0x00])]
#[case("2023-10-15T05:04:03.000000123+01:00", &[0x8C, 0x35, 0x7D, 0x85, 0xE0, 0x0D, 0x7B, 0x00, 0x00, 0x00])]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ The reader was treating the offset bits in the encoding as 2s complement bits; however, they are an unsigned integer representing the number of quarter hours from -56 (offset -14:00). (Relevant spec entry.)

@@ -266,7 +255,7 @@ impl<'top> LazyRawBinaryValue_1_1<'top> {
(OpcodeType::Integer, 0x0) => 0.into(),
(OpcodeType::Integer, n) => {
// We have n bytes following that make up our integer.
self.input.consume(1).read_fixed_int(n)?.0.into()
self.available_body().read_fixed_int(n)?.0.into()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Adding annotations support meant changing a few usages of self.input because that buffer contains the annotations sequence (previously empty) and the value itself.

Comment on lines -316 to +317
None => (1, 1, 0, 0), // Unknown offset uses a single bit (1); opcode and length stay the same.
Some(0) => (1, 0, 0, 0), // UTC uses a single bit (0); opcode and length stay the same.
None => (1, 0, 0, 0), // Unknown offset uses a single bit (0); opcode and length stay the same.
Some(0) => (1, 1, 0, 0), // UTC uses a single bit (1); opcode and length stay the same.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ We only recently defined which offset bit value meant UTC and which meant unknown. I implemented this code long ago and guessed wrong. 🙃

Comment on lines -1159 to +1162
&[0x83, 0b0011_0110, 0b0000_1011, 0b0000_1000, 0b0000_0000],
&[0x83, 0b0011_0110, 0b0000_1011, 0b0000_1000, 0b0000_1000],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Updating all of the timestamp encoding tests to use UTC=1, Unknown=0.

@@ -2797,4 +2800,170 @@ mod tests {
)?;
Ok(())
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ This section adds our first application-level roundtrip testing for binary 1.1. Many of the tests involve symbols/annotations/field names which (for the moment) are encoded as symbol IDs. This means an LST is written and read in addition to the values in the text. It exercises a good chunk of the APIs.

Comment on lines +636 to +643
pub fn has_annotations(&self) -> bool {
use ExpandedValueSource::*;
match &self.source {
ValueLiteral(value) => value.has_annotations(),
Template(_, element) => !element.annotations().is_empty(),
Constructed(annotations, _) => !annotations.is_empty(),
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ When I started implementing this, LazyValue's has_annotations (a bit further below) would call annotations() and see if it produced anything. However, annotations was a todo!() impl, so it would panic.

This implementation doesn't call the iterator and should be cheaper.

Comment on lines +315 to +320
if lazy_value.has_annotations() {
let annotations: Annotations = lazy_value.annotations().try_into()?;
Ok(value.with_annotations(annotations))
} else {
Ok(value.into())
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ This only calls annotations() if has_annotations() returns true.

@zslayton zslayton requested a review from popematt June 7, 2024 21:28
let raw_symbol = match flex_sym.value() {
FlexSymValue::SymbolRef(raw_symbol) => raw_symbol,
FlexSymValue::Opcode(_) => {
todo!("FlexSym escapes in annotation sequences")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are $0 and '' going to be handled here or is that handled elsewhere so that it falls under the SymbolRef case?

Copy link
Contributor Author

@zslayton zslayton Jun 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nirosys can comment on his intent, but I believe that's the idea.

// can actually do the reading. That optimization would be impactful for FlexSyms
// that represent inline text.
let (sequence, remaining_input) = match opcode.length_code {
7 => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is misleading that we're matching length_code here, and the length is actually unknown number of bytes or known to be 1 FlexSym—but the number we're matching is 7. Can we create an issue to rename length_code to low_nibble or something else like that? Or, if length_code is accurate for everything else, then maybe we figure out a way to set the length code to be something representative of the actual length. Alternately, we could just match the whole opcode byte here. Readability would improve if we were matching against E7, E8, etc.

Copy link
Contributor Author

@zslayton zslayton Jun 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agreed. It's length_code in like ... 80% of cases.

OpcodeType is (intentionally) a coarse enumeration:

pub enum OpcodeType {
EExpressionWithAddress, // 0x00-0x4F -
EExpressionAddressFollows, // 0x40-0x4F -
Integer, // 0x60-0x68 - Integer up to 8 bytes wide
Float, // 0x6A-0x6D -
Boolean, // 0x6E-0x6F -
Decimal, // 0x70-0x7F -
TimestampShort, // 0x80-0x8F -
String, // 0x90-0x9F -
InlineSymbol, // 0xA0-0xAF -
List, // 0xB0-0xBF -
SExpression, // 0xC0-0xCF -
StructEmpty, // 0xD0 -
// 0xD1 reserved
Struct, // 0xD2-0xDF -
IonVersionMarker, // 0xE0 -
SymbolAddress, // 0xE1-0xE3 -
AnnotationSymAddress, // 0xE4-0xE6 -
AnnotationFlexSym, // 0xE7-0xE9 -
NullNull, // 0xEA -
TypedNull, // 0xEB -
Nop, // 0xEC-0xED -
// Reserved
SystemMacroInvoke, // 0xEF -
// 0xF0 delimited container end
// 0xF1 delimited list start
// 0xF2 delimited s-expression start
// 0xF3 delimited struct start
LargeInteger, // 0xF6 - Integer preceded by FlexUInt length
Blob, // 0xFE -
Clob, // 0xFF -
// 0xF8 Long decimal
TimestampLong, // 0xF8 - Long-form Timestamp
// 0xF9 - Long string
// 0xFA - FlexSym symbol
// 0xFB - Long list
// 0xFC - Long sexp
// 0xFD - Long struct
Invalid, // Represents an encoded value that does not match a defined opcode.
}

It works well in that questions like "is it a float?" are easily answered without range checks.

I'm going to rename length_code to low_nibble, which is always accurate.

EDIT: done in 2cef814

let sequence = EncodedAnnotations {
encoding: AnnotationsEncoding::FlexSym,
header_length: u8::try_from(1 + flex_uint.size_in_bytes()).map_err(|_| {
IonError::decoding_error("found a 256+ byte annotations header")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can FlexUInt even read a number this large? I thought it was limited to 128 bits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can't; the current implementation would always raise an error during the FlexUInt read. Would you like to see this be an unreachable! instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to leave this as-is for the moment. If profiling shows that an unreachable! or a cold_code! would help out here I'll change it.

@zslayton zslayton requested a review from popematt June 8, 2024 14:55
@zslayton zslayton merged commit 81e5fd9 into main Jun 10, 2024
32 checks passed
@zslayton zslayton deleted the binary-1_1-roundtrip branch June 10, 2024 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants