Skip to content

Data types

Benjamin Jacobs edited this page Apr 12, 2024 · 2 revisions

Data types

In mbon, data is made out of items. These items are made of two parts: A mark, and a value. Unless otherwise specified, an item is always a mark followed by data.

All grammars in this document are written with the pest grammar language.

Some examples have a high level representation of an MBON file. These are represented with yaml.

Keys are the mark, any option for the mark are put in <>. The value can either be a literal or a list depending on what the body wants.

Here is an example using the following json

[
  "hello",
  5,
  [
    "hello",
    6.23
  ],
  "Hello World",
  [
    5,
    6,
    7
  ],
]
String: "hello"
u8: 5
List:
- String: "Hello"
- f64: 6.23
Array<V=u8>: b"Hello World"
Array<V=u32>:
- 5
- 6
- 7

There are several blocks of code which are written in a pseudo-code of rust. It will use a familiar rust syntax, but will likely not compile.

Size

Some marks have a size indicator. This indicator is dynamically sized. The indicator is formatted as follows:

The indicator starts at one byte in length. There is a continuation bit in each byte of the indicator. This is the most significant bit in each byte. If it is 1, then there is more to read, otherwise the size indicator is finished.

When reading a size indicator, the most significant bit of each byte is ignored. The value is read as a little-endian unsigned integer. Overall, sizes may not be larger than 64 bits or 10 characters.

Size Grammar

SizeEnd      = { '\x00'..'\x7f' } // 0b0000_0000 through 0b0111_1111
SizeContinue = { '\x80'..'\xff' } // 0b1000_0000 through 0b1111_1111
Size = { SizeContinue ~ Size | SizeEnd }

Examples

Given the data (hex)5a b3 06, We would first read 5a which is (bin)0 1011010. We add 0b1011010 << (0 * 7) to the sum and get 0x5a. The Most significant bit is 0, so we are done with a final size of 90.

Given the data (hex)b3 06, We read b3 (bin)1 0110011. We add 0b0110011 << (0 * 7) to the sum and get 0x33. The most significant bit is 1, so we read the next byte (hex)06 (bin)0 0000011. We add 0b0000011 << (1 * 7) to the sum and get 0x1b3. The most significant bit is 0, so we are done with a final size of 435.

IDs

Every item has an id to identify its type. This is a single byte which is used to know what the type is. There are five parts to an id: E, P, S, T and B.

  • B bit 7: whether there is a body associated with the type.
  • P bit 6: whether the type is publicly available.
  • S bit 5: whether the type has a fixed size.
  • If S == 1
    • T bits 2-4: the type id (which is only unique to each E, P, S combination)
    • B bits 0-1: The number of bytes in the fixed size value (which is 2^B).
  • If S == 0
    • T bits 0-4: the type id (which is only unique to each E, P, S combination)

Below is a diagram of how the bits are structured in the id byte as well as some pseudo-code definitions that will be used in type descriptions.

7 6 5 432 10
B P 1 TTT BB
B P 0 TTT TT
let B = 1u8 << 7;
let P = 1u8 << 6;
let S = |r: Range<u8>, id: u8| r.map(|v| ((id << 2) & 0b0001_1100) | B(v));
let T = |id: u8| id & 0b0001_1111;
let len_b = |id: u8| 2u8.pow(id & 0b0000_0011);

Types

Below are definitions of all the mbon types.

Null

A null data type is represented by the id (hex)40. There is nothing more to the mark.

There is no data associated with the null type.

let id = P | T(0); // 0x40
let len = 0;

Null Grammar

MarkNull = { "\x40" }

Unsigned

The unsigned data type is represented by the ids (hex)E0 E1 E2 E3. There is nothing more to the mark.

The data is a little-endian unsigned integer of len_b(id) bytes.

  • E0: 1-byte (u8)
  • E1: 2-byte (u16)
  • E2: 4-byte (u32)
  • E3: 8-byte (u64)
let id = B | P | S | S(0..4, 0); // [0xE0, 0xE1, 0xE2, 0xE3]
let len = len_b(id);

Unsigned Grammar

MarkU8  = { "\xe0" }
MarkU16 = { "\xe1" }
MarkU32 = { "\xe2" }
MarkU64 = { "\xe3" }
MarkUnsigned = { MaarkU8 | MarkU16 | MarkU32 | MarkU64 }

Signed

The signed data type is represented by the ids (hex)E4 E5 E6 E7. There is nothing more to the mark.

The data is a little-endian signed integer of len_b(id) bytes.

  • E4: 1-byte (i8)
  • E5: 2-byte (i16)
  • E6: 4-byte (i32)
  • E7: 8-byte (i64)
let id = B | P | S | S(0..4, 1); // [0xE4, 0xE5, 0xE6, 0xE7]
let len = len_b(id);

Signed Grammar

MarkI8  = { "\xe4" }
MarkI16 = { "\xe5" }
MarkI32 = { "\xe6" }
MarkI64 = { "\xe7" }
MarkSigned = { MaarkI8 | MarkI16 | MarkI32 | MarkI64 }

Float

The signed data type is represented by the ids (hex)6e 6f. There is nothing more to the mark.

The data is a little-endian IEEE-754 float of len_b(id) bytes.

  • EA: 4-byte (f32)
  • EB: 8-byte (f64)
let id = B | P | S | S(2..4, 2); // [0xEA, 0xEB]
let len = len_b(id);

Float Grammar

MarkF32 = { "\xea" }
MarkF64 = { "\xeb" }
MarkFloat = {  MarkF32 | MarkF64 }

Char

The char data type is represented by the ids (hex)EC ED EE. There is nothing more to the mark.

The data is a little-endian unsigned integer of len_b(id) bytes which represent a UTF code point.

  • EC: 1-byte (u8 char)
  • ED: 2-byte (u16 char)
  • EE: 4-byte (u32 char)
let id = B | P | S | S(0..3, 3); // [0xEC, 0xED, 0xEE]
let len = len_b(id);

Char Grammar

MarkC8  = { "\xec" }
MarkC16 = { "\xed" }
MarkC32 = { "\xee" }
MarkChar = {  MarkC8 | MarkC16 | MarkC32 }

String

A string data type is represented by the id (hex)C0. After the id, is a size indicator we will call L.

The data represented by a string is a UTF-8 encoded string of L bytes.

let id = B | P | T(0); // 0xC0
let len = L;

String Grammar

MarkString = { "\xc0" ~ Size }

Array

An array data type is represented by the id (hex)C5. After the id is a recursive mark we will call V. After V is a size indicator we will call N.

The data represented by an array is a sequence of N data items of type V. No marks are required for each sub-item since it has already been defined by V.

Note that all values in the array must be homogeneous. This severely limits what can be used for an array. If an item cannot be stored in an array, then List should be used instead.

let id = B | P | T(5); // 0xC5
let len = data_len(V) * N;

Array Grammar

MarkArray = { "\xc5" ~ Mark ~ Size }

List

A list data type is represented by the id (hex)C6. After the id is a size indicator we will call L.

The data represented by a list is a sequence of items where the total size of all the items add up to L e.g. The contents of the list must be exactly L bytes long.

let id = B | P | T(6); // 0xC6
let len = L;

List Grammar

MarkList = { "\xc6" ~ Size }

Struct

A struct data type is represented by the id (hex)C8. After the id is two size indicators we will call I, L respectively. I represents the struct id which defines the body of the struct and L represents the number of bytes the data is.

The data is a sequence of item bodies which is defined by the associated Struct Definition. The data must be exactly the same length as L.

let id = B | P | T(8); // 0xC8
let len = L;

Struct Grammar

MarkStruct = { "\xc8" ~ Size ~ Size }

Struct Definition

A struct definition is a private type that defines the structure of a Struct. The mark is represented by the id (hex)88. After the id are two size indicators we will call I and L respectively. I represents the struct id and L represents the number of bytes in the body.

The data is a sequence of item-mark pairs which define the values and order of the struct.

let id = B | T(8); // 0x88
let len = L;

Example

For example, given the following json:

{
  "a": 5,   
  "b": 3.2, 
  "c": "h"
}

A Struct Definition would look like:

StructDef<I=1>:
- String: "a" # First element will be a u8 for the key "a"
- u8
- String: "b" # Second element will be an f64 for the key "b"
- f64
- String: "c" # Third element will be a c8 for the key "c"
- c8
Struct<I=1>:
- 5
- 3.2
- 'h'

Struct Definition Grammar

MarkStructDef = { "\x88" ~ Size ~ Size }

Dict

A dict data type is represented by the id (hex)C9. After the id is two marks we will call K and V respectively. After V is a size indicator we will call N.

The data represented by a struct is a sequence of N pairs of K-V data items. No marks are required for each of these items since they have already been defined by K and V. There are a total of N * 2 items in a dict and each pair of items are K then V.

Note that all values in the struct must be homogeneous. This severely limits what can be used for a struct. If an item cannot be stored in a struct, then Map should be used instead.

let id = B | P | T(9); // 0xC9
let len = (data_len(K) + data_len(V)) * N;

Struct Grammar

MarkStruct = { "\xc9" ~ Mark ~ Mark ~ Size }

Map

A map data type is represented by the id (hex)4C. After the id is a size indicator we will call L.

The data represented by a map is a sequence of pairs of items in a key-value structure. There must be an even number of items in a map, and the total length of the data must be equal to L.

let id = B | P | T(10); // 0xCA
let len = L;

Map Grammar

MarkMap = { "\xca" ~ Size }

Enum

The enum data type is represented by the ids (hex)F0 F1 F2. After the id is a recursive mark we will call V.

The data represented by the enum is a little-endian unsigned integer with len_b(id) bytes which represents the variant of the enum. After the variant value is the data of V. No mark is required since V has already been defined

  • F0: 1-byte (u8 variant)
  • F1: 2-byte (u16 variant)
  • F2: 4-byte (u32 variant)
let id = B | P | S(0..3, 4); // [0xF0, 0xF1, 0xF2]
let len = len_b(id) + data_len(v);

Enum Grammar

MarkE8  = { "\xf0"}
MarkE16 = { "\xf1" }
MarkE32 = { "\xf2" }
MarkEnum = { (MarkE8 | MarkE16 | MarkE32) ~ Mark }

Space

The space type is represented by the id (hex)00. There is nothing more to the mark.

There is no data associated with space.

The space type is used as padding between items if needed. Whenever possible, Padding should be preferred.

let id = T(0); // 0x00
let len = 0;

Space Grammar

MarkSpace = { "\x00" }

Padding

The padding type is represented by the id (hex)80. After the id is a size indicator we will call L.

The data of a reserved item is L bytes of unused space. The contents should not be read from since it will be considered junk.

let id = B | T(0); // 0x80
let len = L;

Padding Grammar

MarkPadding = { "\x80" ~ Size }

Pointer

The pointer type is represented by the ids (hex)A0 A1 A2 A3. There is nothing else to the mark.

The data of a pointer is a little-endian unsigned integer with len_b(id) bytes which represent a location in the mbon file we will call P. The contents at P must be the start of a valid mbon item.

  • A0: 1-byte (u8 address)
  • A1: 2-byte (u16 address)
  • A2: 4-byte (u32 address)
  • A3: 8-byte (u64 address)
let id = B | S(0..4, 0); // [0xA0, 0xA1, 0xA2, 0xA3],
let len = len_b(id);

Pointer Grammar

MarkP8  = { "\xa0" }
MarkP16 = { "\xa1" }
MarkP32 = { "\xa2" }
MarkP64 = { "\xa3" }
MarkPointer = { MarkP8 | MarkP16 | MarkP32 | MarkP64 }

Rc

The rc type is represented by the ids (hex)A4 A5 A6 A7. After the id is a mark we will call V.

The data of an rc is a little-endian unsigned integer with len_b(id) bytes that represents the number of references to this item. After which is the data for V. No mark is required since V has already been defined.

Rc's should always be used alongside Pointers. They should be treated like an invisible box most of the time; Only when doing pointer operations should rc's be considered.

  • A4: 1-byte (u8 reference count)
  • A5: 2-byte (u16 reference count)
  • A6: 4-byte (u32 reference count)
  • A7: 8-byte (u64 reference count)
let id = B | S(0..4, 1); // [0xA4, 0xA5, 0xA6, 0xA7],
let len = len_b(id) + data_len(V);

Rc Grammar

MarkR8  = { "\xa4" }
MarkR16 = { "\xa5" }
MarkR32 = { "\xa6" }
MarkR64 = { "\xa7" }
MarkRc = { (MarkR8 | MarkR16 | MarkR32 | MarkR64) ~ Mark }

Heap

The heap type is represented by the id (hex)81. After the id is a size indicator we will call L.

The data of the heap is a sequence of items where the total size of all the items add up to L.

The contents of the heap are hidden from the user, in other words it should be treated like padding, but with valid data inside. The only way the user can access items in the heap is through Pointers. The heap should be a root level item of the mbon file.

let id = B | T(1); // 0x81
let len = L;

Heap Grammar

MarkHeap = { "\x81" ~ Size }

Full Mark Grammar

Below is a comprehensive grammar for marks in the mbon format.

SizeEnd      = { '\x00'..'\x7f' } // 0b0......
SizeContinue = { '\x80'..'\xff' } // 0b1......
Size = { SizeContinue ~ Size | SizeEnd }

Mark = {
        MarkNull 
      | MarkUnsigned | MarkSigned | MarkFloat 
      | MarkChar     | MarkString 
      | MarkArray    | MarkList 
      | MarkStruct   | MarkDict   | MarkMap 
      | MarkEnum 
      | MarkSpace    | MarkPadding 
      | MarkPointer  | MarkRc     | MarkHeap
}

MarkNull = { "\x40" }

MarkU8       = { "\xe0" }
MarkU16      = { "\xe1" }
MarkU32      = { "\xe2" }
MarkU64      = { "\xe3" }
MarkUnsigned = { MaarkU8 | MarkU16 | MarkU32 | MarkU64 }

MarkI8     = { "\xe4" }
MarkI16    = { "\xe5" }
MarkI32    = { "\xe6" }
MarkI64    = { "\xe7" }
MarkSigned = { MaarkI8 | MarkI16 | MarkI32 | MarkI64 }

MarkF32   = { "\xea" }
MarkF64   = { "\xeb" }
MarkFloat = {  MarkF32 | MarkF64 }

MarkC8   = { "\xec" }
MarkC16  = { "\xed" }
MarkC32  = { "\xee" }
MarkChar = {  MarkC8 | MarkC16 | MarkC32 }

MarkString = { "\xc0" ~ Size }

MarkArray  = { "\xc5" ~ Mark ~ Size }
MarkList   = { "\xc6" ~ Size }

MarkStruct    = { "\xc8" ~ Size ~ Size }
MarkStructDef = { "\x88" ~ Size ~ Size }
MarkDict      = { "\xc9" ~ Mark ~ Mark ~ Size }
MarkMap       = { "\xca" ~ Size }

MarkE8   = { "\xf0"}
MarkE16  = { "\xf1" }
MarkE32  = { "\xf2" }
MarkEnum = { (MarkE8 | MarkE16 | MarkE32) ~ Mark }

MarkSpace   = { "\x00" }

MarkPadding = { "\x80" ~ Size }

MarkP8      = { "\xa0" }
MarkP16     = { "\xa1" }
MarkP32     = { "\xa2" }
MarkP64     = { "\xa3" }
MarkPointer = { MarkP8 | MarkP16 | MarkP32 | MarkP64 }

MarkR8  = { "\xa4" }
MarkR16 = { "\xa5" }
MarkR32 = { "\xa6" }
MarkR64 = { "\xa7" }
MarkRc  = { (MarkR8 | MarkR16 | MarkR32 | MarkR64) ~ Mark }

MarkHeap = { "\x81" ~ Size }