Implements flexsym annotation reading, adds binary 1.1 roundtrip unit tests #788

zslayton · 2024-06-07T20:57:18Z

The binary 1.1 reader can now read FlexSym-encoded annotations sequences. I have also added some application-level roundtripping unit tests to identify places where the reader and writer may disagree. Because the tests are application-level, the writer generates an LST for several of the tests; quite a bit of the API is being exercised.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…okes `annotations`

zslayton

🗺️ PR tour 🧭

zslayton · 2024-06-07T20:59:26Z

src/lazy/binary/immutable_buffer.rs

@@ -745,7 +747,7 @@ impl<'a> ImmutableBuffer<'a> {
        }

        lazy_value.encoded_value.annotations_header_length = wrapper.header_length;
-        lazy_value.encoded_value.annotations_sequence_length = wrapper.sequence_length;
+        lazy_value.encoded_value.annotations_sequence_length = wrapper.sequence_length as u16;


🗺️ I widened the annotations sequence length from a u8 to a u16 since the annotations in v1.1 may have inline text.

zslayton · 2024-06-07T21:01:28Z

src/lazy/binary/raw/v1_1/immutable_buffer.rs

-    pub fn read_annotations_wrapper(&self, _opcode: Opcode) -> ParseResult<'a, AnnotationsWrapper> {
-        todo!();
-    }
-


🗺️ I removed this stub (that I added a while ago) because v1.1 annotations don't involve a 'wrapper' like v1.0 annotations.

zslayton · 2024-06-07T21:02:51Z

src/lazy/binary/raw/v1_1/immutable_buffer.rs

+        value.encoded_value.annotations_header_length = annotations_seq.header_length;
+        value.encoded_value.annotations_sequence_length = annotations_seq.sequence_length;
+        value.encoded_value.annotations_encoding = annotations_seq.encoding;
+        value.encoded_value.total_length +=
+            annotations_seq.header_length as usize + annotations_seq.sequence_length as usize;
+        // Rewind the input to include the annotations sequence
+        value.input = self;


🗺️ After reading the value, this method goes back and overwrites all of the annotations-related or -affected fields.

zslayton · 2024-06-07T21:06:31Z

src/lazy/binary/raw/v1_1/reader.rs

+    #[case("2023-10-15T05:04Z",                   &[0x83, 0x35, 0x7D, 0x85, 0x00])]
+    #[case("2023-10-15T05:04:03Z",                &[0x84, 0x35, 0x7D, 0x85, 0x30, 0x00])]
+    #[case("2023-10-15T05:04:03.123-00:00",       &[0x85, 0x35, 0x7D, 0x85, 0x38, 0xEC, 0x01])]
+    #[case("2023-10-15T05:04:03.000123-00:00",    &[0x86, 0x35, 0x7D, 0x85, 0x38, 0xEC, 0x01, 0x00])]
+    #[case("2023-10-15T05:04:03.000000123-00:00", &[0x87, 0x35, 0x7D, 0x85, 0x38, 0xEC, 0x01, 0x00, 0x00])]
+    #[case("2023-10-15T05:04+01:00",              &[0x88, 0x35, 0x7D, 0x85, 0xE0, 0x01])]
+    #[case("2023-10-15T05:04-01:00",              &[0x88, 0x35, 0x7D, 0x85, 0xA0, 0x01])]
+    #[case("2023-10-15T05:04:03+01:00",           &[0x89, 0x35, 0x7D, 0x85, 0xE0, 0x0D])]
+    #[case("2023-10-15T05:04:03.123+01:00",       &[0x8A, 0x35, 0x7D, 0x85, 0xE0, 0x0D, 0x7B, 0x00])]
+    #[case("2023-10-15T05:04:03.000123+01:00",    &[0x8B, 0x35, 0x7D, 0x85, 0xE0, 0x0D, 0x7B, 0x00, 0x00])]
+    #[case("2023-10-15T05:04:03.000000123+01:00", &[0x8C, 0x35, 0x7D, 0x85, 0xE0, 0x0D, 0x7B, 0x00, 0x00, 0x00])]


🗺️ The reader was treating the offset bits in the encoding as 2s complement bits; however, they are an unsigned integer representing the number of quarter hours from -56 (offset -14:00). (Relevant spec entry.)

zslayton · 2024-06-07T21:07:44Z

src/lazy/binary/raw/v1_1/value.rs

@@ -266,7 +255,7 @@ impl<'top> LazyRawBinaryValue_1_1<'top> {
            (OpcodeType::Integer, 0x0) => 0.into(),
            (OpcodeType::Integer, n) => {
                // We have n bytes following that make up our integer.
-                self.input.consume(1).read_fixed_int(n)?.0.into()
+                self.available_body().read_fixed_int(n)?.0.into()


🗺️ Adding annotations support meant changing a few usages of self.input because that buffer contains the annotations sequence (previously empty) and the value itself.

zslayton · 2024-06-07T21:24:26Z

src/lazy/encoder/binary/v1_1/value_writer.rs

-                None => (1, 1, 0, 0), // Unknown offset uses a single bit (1); opcode and length stay the same.
-                Some(0) => (1, 0, 0, 0), // UTC uses a single bit (0); opcode and length stay the same.
+                None => (1, 0, 0, 0), // Unknown offset uses a single bit (0); opcode and length stay the same.
+                Some(0) => (1, 1, 0, 0), // UTC uses a single bit (1); opcode and length stay the same.


🗺️ We only recently defined which offset bit value meant UTC and which meant unknown. I implemented this code long ago and guessed wrong. 🙃

zslayton · 2024-06-07T21:24:49Z

src/lazy/encoder/binary/v1_1/value_writer.rs

-                &[0x83, 0b0011_0110, 0b0000_1011, 0b0000_1000, 0b0000_0000],
+                &[0x83, 0b0011_0110, 0b0000_1011, 0b0000_1000, 0b0000_1000],


🗺️ Updating all of the timestamp encoding tests to use UTC=1, Unknown=0.

zslayton · 2024-06-07T21:26:42Z

src/lazy/encoder/binary/v1_1/value_writer.rs

@@ -2797,4 +2800,170 @@ mod tests {
        )?;
        Ok(())
    }
+


🗺️ This section adds our first application-level roundtrip testing for binary 1.1. Many of the tests involve symbols/annotations/field names which (for the moment) are encoded as symbol IDs. This means an LST is written and read in addition to the values in the text. It exercises a good chunk of the APIs.

zslayton · 2024-06-07T21:28:00Z

src/lazy/expanded/mod.rs

+    pub fn has_annotations(&self) -> bool {
+        use ExpandedValueSource::*;
+        match &self.source {
+            ValueLiteral(value) => value.has_annotations(),
+            Template(_, element) => !element.annotations().is_empty(),
+            Constructed(annotations, _) => !annotations.is_empty(),
+        }
+    }


🗺️ When I started implementing this, LazyValue's has_annotations (a bit further below) would call annotations() and see if it produced anything. However, annotations was a todo!() impl, so it would panic.

This implementation doesn't call the iterator and should be cheaper.

zslayton · 2024-06-07T21:28:40Z

src/lazy/value.rs

+        if lazy_value.has_annotations() {
+            let annotations: Annotations = lazy_value.annotations().try_into()?;
+            Ok(value.with_annotations(annotations))
+        } else {
+            Ok(value.into())
+        }


🗺️ This only calls annotations() if has_annotations() returns true.

popematt · 2024-06-07T23:24:57Z

src/lazy/binary/raw/v1_1/annotations_iterator.rs

+                let raw_symbol = match flex_sym.value() {
+                    FlexSymValue::SymbolRef(raw_symbol) => raw_symbol,
+                    FlexSymValue::Opcode(_) => {
+                        todo!("FlexSym escapes in annotation sequences")


Are $0 and '' going to be handled here or is that handled elsewhere so that it falls under the SymbolRef case?

@nirosys can comment on his intent, but I believe that's the idea.

popematt · 2024-06-07T23:30:37Z

src/lazy/binary/raw/v1_1/immutable_buffer.rs

+        //       can actually do the reading. That optimization would be impactful for FlexSyms
+        //       that represent inline text.
+        let (sequence, remaining_input) = match opcode.length_code {
+            7 => {


This is misleading that we're matching length_code here, and the length is actually unknown number of bytes or known to be 1 FlexSym—but the number we're matching is 7. Can we create an issue to rename length_code to low_nibble or something else like that? Or, if length_code is accurate for everything else, then maybe we figure out a way to set the length code to be something representative of the actual length. Alternately, we could just match the whole opcode byte here. Readability would improve if we were matching against E7, E8, etc.

Yeah, agreed. It's length_code in like ... 80% of cases.

OpcodeType is (intentionally) a coarse enumeration:

ion-rust/src/lazy/binary/raw/v1_1/type_code.rs

Lines 15 to 55 in 8d1178e

pub enum OpcodeType {

EExpressionWithAddress, // 0x00-0x4F -

EExpressionAddressFollows, // 0x40-0x4F -

Integer, // 0x60-0x68 - Integer up to 8 bytes wide

Float, // 0x6A-0x6D -

Boolean, // 0x6E-0x6F -

Decimal, // 0x70-0x7F -

TimestampShort, // 0x80-0x8F -

String, // 0x90-0x9F -

InlineSymbol, // 0xA0-0xAF -

List, // 0xB0-0xBF -

SExpression, // 0xC0-0xCF -

StructEmpty, // 0xD0 -

// 0xD1 reserved

Struct, // 0xD2-0xDF -

IonVersionMarker, // 0xE0 -

SymbolAddress, // 0xE1-0xE3 -

AnnotationSymAddress, // 0xE4-0xE6 -

AnnotationFlexSym, // 0xE7-0xE9 -

NullNull, // 0xEA -

TypedNull, // 0xEB -

Nop, // 0xEC-0xED -

// Reserved

SystemMacroInvoke, // 0xEF -

// 0xF0 delimited container end

// 0xF1 delimited list start

// 0xF2 delimited s-expression start

// 0xF3 delimited struct start

LargeInteger, // 0xF6 - Integer preceded by FlexUInt length

Blob, // 0xFE -

Clob, // 0xFF -

// 0xF8 Long decimal

TimestampLong, // 0xF8 - Long-form Timestamp

// 0xF9 - Long string

// 0xFA - FlexSym symbol

// 0xFB - Long list

// 0xFC - Long sexp

// 0xFD - Long struct

Invalid, // Represents an encoded value that does not match a defined opcode.

}

It works well in that questions like "is it a float?" are easily answered without range checks.

I'm going to rename length_code to low_nibble, which is always accurate.

EDIT: done in 2cef814

popematt · 2024-06-07T23:36:47Z

src/lazy/binary/raw/v1_1/immutable_buffer.rs

+                let sequence = EncodedAnnotations {
+                    encoding: AnnotationsEncoding::FlexSym,
+                    header_length: u8::try_from(1 + flex_uint.size_in_bytes()).map_err(|_| {
+                        IonError::decoding_error("found a 256+ byte annotations header")


Can FlexUInt even read a number this large? I thought it was limited to 128 bits.

It can't; the current implementation would always raise an error during the FlexUInt read. Would you like to see this be an unreachable! instead?

I'm going to leave this as-is for the moment. If profiling shows that an unreachable! or a cold_code! would help out here I'll change it.

zslayton added 3 commits June 7, 2024 10:46

Adds binary 1.1 roundtrip unit tests, has_annotations no longer inv…

c78ae2f

…okes `annotations`

Implements reading binary FlexSym annotations

344826c

Fixes timestamp decoding of offsets

8b08c4d

zslayton requested a review from nirosys June 7, 2024 20:57

zslayton added 3 commits June 7, 2024 16:57

cargo fmt

b8b9e1d

1.1 binary writer now uses 1 for UTC and 0 for unknown

bfa0f03

fixed typo

b71b6e3

zslayton commented Jun 7, 2024

View reviewed changes

zslayton requested a review from popematt June 7, 2024 21:28

clippy suggestion

9470b5b

popematt reviewed Jun 7, 2024

View reviewed changes

zslayton added 2 commits June 8, 2024 10:41

rename length_code to low_nibble

2cef814

adds low_nibble accessor

d564826

zslayton requested a review from popematt June 8, 2024 14:55

popematt approved these changes Jun 10, 2024

View reviewed changes

zslayton merged commit 81e5fd9 into main Jun 10, 2024
32 checks passed

zslayton deleted the binary-1_1-roundtrip branch June 10, 2024 18:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implements flexsym annotation reading, adds binary 1.1 roundtrip unit tests #788

Implements flexsym annotation reading, adds binary 1.1 roundtrip unit tests #788

zslayton commented Jun 7, 2024

zslayton left a comment

zslayton Jun 7, 2024

zslayton Jun 7, 2024

zslayton Jun 7, 2024

zslayton Jun 7, 2024

zslayton Jun 7, 2024

zslayton Jun 7, 2024

zslayton Jun 7, 2024

zslayton Jun 7, 2024

zslayton Jun 7, 2024

zslayton Jun 7, 2024

popematt Jun 7, 2024

zslayton Jun 8, 2024 •

edited

Loading

popematt Jun 7, 2024

zslayton Jun 8, 2024 •

edited

Loading

popematt Jun 7, 2024

zslayton Jun 8, 2024

zslayton Jun 8, 2024

		&[0x83, 0b0011_0110, 0b0000_1011, 0b0000_1000, 0b0000_0000],
		&[0x83, 0b0011_0110, 0b0000_1011, 0b0000_1000, 0b0000_1000],

@@ @@ -2797,4 +2800,170 @@ mod tests { @@
                       )?;
                       Ok(())
                   }

	pub enum OpcodeType {
	EExpressionWithAddress, // 0x00-0x4F -
	EExpressionAddressFollows, // 0x40-0x4F -
	Integer, // 0x60-0x68 - Integer up to 8 bytes wide
	Float, // 0x6A-0x6D -
	Boolean, // 0x6E-0x6F -
	Decimal, // 0x70-0x7F -
	TimestampShort, // 0x80-0x8F -
	String, // 0x90-0x9F -
	InlineSymbol, // 0xA0-0xAF -
	List, // 0xB0-0xBF -
	SExpression, // 0xC0-0xCF -
	StructEmpty, // 0xD0 -
	// 0xD1 reserved
	Struct, // 0xD2-0xDF -
	IonVersionMarker, // 0xE0 -

	SymbolAddress, // 0xE1-0xE3 -
	AnnotationSymAddress, // 0xE4-0xE6 -
	AnnotationFlexSym, // 0xE7-0xE9 -
	NullNull, // 0xEA -
	TypedNull, // 0xEB -
	Nop, // 0xEC-0xED -
	// Reserved
	SystemMacroInvoke, // 0xEF -
	// 0xF0 delimited container end
	// 0xF1 delimited list start
	// 0xF2 delimited s-expression start
	// 0xF3 delimited struct start
	LargeInteger, // 0xF6 - Integer preceded by FlexUInt length
	Blob, // 0xFE -
	Clob, // 0xFF -
	// 0xF8 Long decimal
	TimestampLong, // 0xF8 - Long-form Timestamp
	// 0xF9 - Long string
	// 0xFA - FlexSym symbol
	// 0xFB - Long list
	// 0xFC - Long sexp
	// 0xFD - Long struct
	Invalid, // Represents an encoded value that does not match a defined opcode.
	}

Implements flexsym annotation reading, adds binary 1.1 roundtrip unit tests #788

Implements flexsym annotation reading, adds binary 1.1 roundtrip unit tests #788

Conversation

zslayton commented Jun 7, 2024

zslayton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zslayton Jun 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zslayton Jun 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zslayton Jun 8, 2024 •

edited

Loading

zslayton Jun 8, 2024 •

edited

Loading