-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implements flexsym annotation reading, adds binary 1.1 roundtrip unit tests #788
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ PR tour 🧭
@@ -745,7 +747,7 @@ impl<'a> ImmutableBuffer<'a> { | |||
} | |||
|
|||
lazy_value.encoded_value.annotations_header_length = wrapper.header_length; | |||
lazy_value.encoded_value.annotations_sequence_length = wrapper.sequence_length; | |||
lazy_value.encoded_value.annotations_sequence_length = wrapper.sequence_length as u16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ I widened the annotations sequence length from a u8
to a u16
since the annotations in v1.1 may have inline text.
pub fn read_annotations_wrapper(&self, _opcode: Opcode) -> ParseResult<'a, AnnotationsWrapper> { | ||
todo!(); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ I removed this stub (that I added a while ago) because v1.1 annotations don't involve a 'wrapper' like v1.0 annotations.
value.encoded_value.annotations_header_length = annotations_seq.header_length; | ||
value.encoded_value.annotations_sequence_length = annotations_seq.sequence_length; | ||
value.encoded_value.annotations_encoding = annotations_seq.encoding; | ||
value.encoded_value.total_length += | ||
annotations_seq.header_length as usize + annotations_seq.sequence_length as usize; | ||
// Rewind the input to include the annotations sequence | ||
value.input = self; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ After reading the value, this method goes back and overwrites all of the annotations-related or -affected fields.
#[case("2023-10-15T05:04Z", &[0x83, 0x35, 0x7D, 0x85, 0x00])] | ||
#[case("2023-10-15T05:04:03Z", &[0x84, 0x35, 0x7D, 0x85, 0x30, 0x00])] | ||
#[case("2023-10-15T05:04:03.123-00:00", &[0x85, 0x35, 0x7D, 0x85, 0x38, 0xEC, 0x01])] | ||
#[case("2023-10-15T05:04:03.000123-00:00", &[0x86, 0x35, 0x7D, 0x85, 0x38, 0xEC, 0x01, 0x00])] | ||
#[case("2023-10-15T05:04:03.000000123-00:00", &[0x87, 0x35, 0x7D, 0x85, 0x38, 0xEC, 0x01, 0x00, 0x00])] | ||
#[case("2023-10-15T05:04+01:00", &[0x88, 0x35, 0x7D, 0x85, 0xE0, 0x01])] | ||
#[case("2023-10-15T05:04-01:00", &[0x88, 0x35, 0x7D, 0x85, 0xA0, 0x01])] | ||
#[case("2023-10-15T05:04:03+01:00", &[0x89, 0x35, 0x7D, 0x85, 0xE0, 0x0D])] | ||
#[case("2023-10-15T05:04:03.123+01:00", &[0x8A, 0x35, 0x7D, 0x85, 0xE0, 0x0D, 0x7B, 0x00])] | ||
#[case("2023-10-15T05:04:03.000123+01:00", &[0x8B, 0x35, 0x7D, 0x85, 0xE0, 0x0D, 0x7B, 0x00, 0x00])] | ||
#[case("2023-10-15T05:04:03.000000123+01:00", &[0x8C, 0x35, 0x7D, 0x85, 0xE0, 0x0D, 0x7B, 0x00, 0x00, 0x00])] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ The reader was treating the offset
bits in the encoding as 2s complement bits; however, they are an unsigned integer representing the number of quarter hours from -56
(offset -14:00
). (Relevant spec entry.)
@@ -266,7 +255,7 @@ impl<'top> LazyRawBinaryValue_1_1<'top> { | |||
(OpcodeType::Integer, 0x0) => 0.into(), | |||
(OpcodeType::Integer, n) => { | |||
// We have n bytes following that make up our integer. | |||
self.input.consume(1).read_fixed_int(n)?.0.into() | |||
self.available_body().read_fixed_int(n)?.0.into() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ Adding annotations support meant changing a few usages of self.input
because that buffer contains the annotations sequence (previously empty) and the value itself.
None => (1, 1, 0, 0), // Unknown offset uses a single bit (1); opcode and length stay the same. | ||
Some(0) => (1, 0, 0, 0), // UTC uses a single bit (0); opcode and length stay the same. | ||
None => (1, 0, 0, 0), // Unknown offset uses a single bit (0); opcode and length stay the same. | ||
Some(0) => (1, 1, 0, 0), // UTC uses a single bit (1); opcode and length stay the same. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ We only recently defined which offset
bit value meant UTC
and which meant unknown
. I implemented this code long ago and guessed wrong. 🙃
&[0x83, 0b0011_0110, 0b0000_1011, 0b0000_1000, 0b0000_0000], | ||
&[0x83, 0b0011_0110, 0b0000_1011, 0b0000_1000, 0b0000_1000], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ Updating all of the timestamp encoding tests to use UTC=1
, Unknown=0
.
@@ -2797,4 +2800,170 @@ mod tests { | |||
)?; | |||
Ok(()) | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ This section adds our first application-level roundtrip testing for binary 1.1. Many of the tests involve symbols/annotations/field names which (for the moment) are encoded as symbol IDs. This means an LST is written and read in addition to the values in the text. It exercises a good chunk of the APIs.
pub fn has_annotations(&self) -> bool { | ||
use ExpandedValueSource::*; | ||
match &self.source { | ||
ValueLiteral(value) => value.has_annotations(), | ||
Template(_, element) => !element.annotations().is_empty(), | ||
Constructed(annotations, _) => !annotations.is_empty(), | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ When I started implementing this, LazyValue
's has_annotations
(a bit further below) would call annotations()
and see if it produced anything. However, annotations was a todo!()
impl, so it would panic.
This implementation doesn't call the iterator and should be cheaper.
if lazy_value.has_annotations() { | ||
let annotations: Annotations = lazy_value.annotations().try_into()?; | ||
Ok(value.with_annotations(annotations)) | ||
} else { | ||
Ok(value.into()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ This only calls annotations()
if has_annotations()
returns true
.
let raw_symbol = match flex_sym.value() { | ||
FlexSymValue::SymbolRef(raw_symbol) => raw_symbol, | ||
FlexSymValue::Opcode(_) => { | ||
todo!("FlexSym escapes in annotation sequences") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are $0
and ''
going to be handled here or is that handled elsewhere so that it falls under the SymbolRef
case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nirosys can comment on his intent, but I believe that's the idea.
// can actually do the reading. That optimization would be impactful for FlexSyms | ||
// that represent inline text. | ||
let (sequence, remaining_input) = match opcode.length_code { | ||
7 => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is misleading that we're matching length_code
here, and the length is actually unknown number of bytes or known to be 1 FlexSym
—but the number we're matching is 7. Can we create an issue to rename length_code
to low_nibble
or something else like that? Or, if length_code
is accurate for everything else, then maybe we figure out a way to set the length code to be something representative of the actual length. Alternately, we could just match the whole opcode byte here. Readability would improve if we were matching against E7
, E8
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agreed. It's length_code
in like ... 80% of cases.
OpcodeType
is (intentionally) a coarse enumeration:
ion-rust/src/lazy/binary/raw/v1_1/type_code.rs
Lines 15 to 55 in 8d1178e
pub enum OpcodeType { | |
EExpressionWithAddress, // 0x00-0x4F - | |
EExpressionAddressFollows, // 0x40-0x4F - | |
Integer, // 0x60-0x68 - Integer up to 8 bytes wide | |
Float, // 0x6A-0x6D - | |
Boolean, // 0x6E-0x6F - | |
Decimal, // 0x70-0x7F - | |
TimestampShort, // 0x80-0x8F - | |
String, // 0x90-0x9F - | |
InlineSymbol, // 0xA0-0xAF - | |
List, // 0xB0-0xBF - | |
SExpression, // 0xC0-0xCF - | |
StructEmpty, // 0xD0 - | |
// 0xD1 reserved | |
Struct, // 0xD2-0xDF - | |
IonVersionMarker, // 0xE0 - | |
SymbolAddress, // 0xE1-0xE3 - | |
AnnotationSymAddress, // 0xE4-0xE6 - | |
AnnotationFlexSym, // 0xE7-0xE9 - | |
NullNull, // 0xEA - | |
TypedNull, // 0xEB - | |
Nop, // 0xEC-0xED - | |
// Reserved | |
SystemMacroInvoke, // 0xEF - | |
// 0xF0 delimited container end | |
// 0xF1 delimited list start | |
// 0xF2 delimited s-expression start | |
// 0xF3 delimited struct start | |
LargeInteger, // 0xF6 - Integer preceded by FlexUInt length | |
Blob, // 0xFE - | |
Clob, // 0xFF - | |
// 0xF8 Long decimal | |
TimestampLong, // 0xF8 - Long-form Timestamp | |
// 0xF9 - Long string | |
// 0xFA - FlexSym symbol | |
// 0xFB - Long list | |
// 0xFC - Long sexp | |
// 0xFD - Long struct | |
Invalid, // Represents an encoded value that does not match a defined opcode. | |
} |
It works well in that questions like "is it a float?" are easily answered without range checks.
I'm going to rename length_code
to low_nibble
, which is always accurate.
EDIT: done in 2cef814
let sequence = EncodedAnnotations { | ||
encoding: AnnotationsEncoding::FlexSym, | ||
header_length: u8::try_from(1 + flex_uint.size_in_bytes()).map_err(|_| { | ||
IonError::decoding_error("found a 256+ byte annotations header") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can FlexUInt
even read a number this large? I thought it was limited to 128 bits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can't; the current implementation would always raise an error during the FlexUInt
read. Would you like to see this be an unreachable!
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to leave this as-is for the moment. If profiling shows that an unreachable!
or a cold_code!
would help out here I'll change it.
The binary 1.1 reader can now read
FlexSym
-encoded annotations sequences. I have also added some application-level roundtripping unit tests to identify places where the reader and writer may disagree. Because the tests are application-level, the writer generates an LST for several of the tests; quite a bit of the API is being exercised.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.