Idk why I am doing this. But this "guide" shows how to serialize js values into their v8 binary representation and how to deserialize the v8 binary representation back to a js value used by the V8 Engine.
The format always starts with 2 header bytes 0xFF 0x0F
Then it will use a
indicator byte to tell the deserializer on how to deserialize the next section
of bytes. None of the format examples include the 2 header bytes but these are
needed at the beginning of the serialized data. The reference serializers and
deserializers don't include these bytes. But the serialize
and deserialize
api's in references/mod.ts
do add these bytes
Primitive values are the following values:
Null is also included in this list eventho it's technically a object but works like a primitive. The difference between a primitive and a object is for example how they're passed as function argument. It is handy to know the difference because objects have a few quirks that primitives don't. But we'll get into that later in the Object Types section.
V8 has 3 types of strings. Utf8String
, OneByteString
and TwoByteString
.
The first one I have no idea what it is used for.
The One byte string is the most common one and is used in places where every character in the string is part of the extended ascii table (0x00 up to 0xFF)
The Two Byte string is used for characters that need 2 bytes to be represented. This would include character sets like arabic and emoji's
All format's all start with a type indicator and then a varint encoded length and then the raw data
Seems to only be used internally. (maybe found with JIT compiled functions. Needs triage)
One byte strings start off with a "
(0x22) to indicate the string datatype.
Then uses a LEB128 encoded varint to indicate the length of the raw data of the
string and then the raw data.
serializing a string like HelloWorld
this will look like this.
0x22 0x0A String indicator byte and varint encoded length
0x48 0x65 He
0x6C 0x6C ll
0x6F 0x57 oW
0x6F 0x72 or
0x6C 0x64 ld
Two byte strings start off with a c
(0x63) to indicate the two byte string
datatype. Then uses a LEB128 encoded varint to indicate the length of the raw
data of the string and then the raw data.
This format is used for when we have characters that need to be represented as multiple bytes. Like emoji's or non-latin languages like arabic
serializing Hi!😃
0x63 0x0A String indicator byte and varint encoded length
0x48 0x00 H (UTF-8 characters don't need a second byte. Therefore null byte)
0x69 0x00 i
0x21 0x00 !
0x3D 0xD8 =Ø (these 4 bytes are the emoji)
0x03 0xDE 0x03 Þ
V8 has 2 integer formats one is unsigned integer and the other one signed. It appears that usually signed integers are used even when unsigned integers can be used.
I am not entirely sure when unsigned integers are used. Signed integers are stored as SMI's. These are 30 bit integers so where are the rest? The other 2 bits are used as SMI flag internally and the other as sign bit. This does not mean that we are limited to 32 bits though. If the value is outside of the SMI range (-1_073_741_824 - 1_073_741_823) then it will use a float to store the int value
The integer format of v8 is quite simple but confusing at first. It uses varint encoding for all signed integers. However it does not use the LEB128 signed varint. Instead it uses the zigzag algorithm used in protobufs.
some examples
Negative Integer -12
0x49 0x17 Indicator byte then zigzag encoded + varint encoded value
Positive Integer 12
0x49 0x18 Indicator byte then zigzag encoded + varint encoded value
Eventho I have not found out where v8 uses this. It is essentially the same as
the signed int format but a different indicator byte U
(0x) and the value does
not get zig-zag encoded unlike the signed integers.
Do note that this format is not compatible with the Signed integer format
because positive integers in that format also get zigzag encoded.
Selfnotes:
- Might be found when using JIT compiled functions.
- Might be used when SMI's are at hand.
The bigint format does not have multiple variants and only has one. It has a
indicator byte which is Z
(0x5A) and after that a varint bitfield specifying
how many u64 integers are used for the bigint and if the value is positive or
negative. It is alot more complex than either a float or integer because it has
a unknown size
Negative BigInt -12
0x5A 0x11 BigInt indicator byte and Varint bitfield.
0x0C 0x00 Bigint value...
0x00 0x00
0x00 0x00
0x00
Positive BigInt 12
0x5A 0x10 BigInt indicator byte and Varint bitfield.
0x0C 0x00 Bigint value...
0x00 0x00
0x00 0x00
0x00
As you can see both are the same and the only difference is the bitfield. In the negative it changed the LSB (Least significant byte) from a 0 to a 1 making it a negative value
The float format is like the integer format, quite easy to understand. It's a
little bit simpler because we don't deal with a variable size of bytes. The
float format is by far the easiest one to understand. You only have an indicator
byte N
(0x4E) and then the float value as 64 bit float (or double)
0x4E 0x00 Indicator byte and first byte of the 64 bit float
0x00 0x00 bytes of the float(12.69)
0x00 0x00
0x00 0x29
0x40
Booleans are the easiest format to serialize. The indicator byte is both the
value and type. F
(0x46) and T
(0x54) Are the boolean type indicators where
F
(0x46) is false and T
(0x54) is true
False
0x46 Indicator byte False
True
0x54 Indicator byte True
The null format is effectively the same as the boolean format. Just that the
indicator byte changed to 0
(0x30).
0x30 Indicator byte Null
Let's repeat the easiest format once again. For undefined
the format is still
the same as null and booleans but the byte changed once again to _
(0x5F)
0x5F Indicator byte Undefined
Referable types are a lot more complex than meets the eye. It looks fairly simple until you realise that javascript is a quirky language that allows things like associative arrays and arrays where there are empty elements or referencing a object in itself creating recursive references. This makes (de)serializing referable types a lot more complex than a simple primitive. For external use of the v8 format the following referable types work with it.
- Object References
- Array
- Object literals
- User created classes
- ArrayBuffer
- Uint*Array (where * is 8, 16, 32 or 64)
- Int*Array (where * is 8, 16, 32 or 64)
- Map
- Set
- Date
- Regex
- Error
These do not work outside of v8 or v8 bindings due to them serializing into a ID that v8 holds the value for.
- SharedArrayBuffer (not usable outside v8)
- WebAssembly.Module (not usable outside v8)
- WebAssembly.Memory (not usable outside v8) (only serializable when
shared = true
)
Object's are used for complex data structures. But it would be a waste of space
and time to serialize the exact same object multiple time. This is what a object
reference is used for. It is indicated by ^
(0x5E)and the best way to explain
how a reference works is with a example. So let's say we got a object with 2
keys and their value is the same object literal like here below
const innerObject = {};
const object = {
key1: innerObject,
key2: innerObject,
};
Then what happens is that object
get's serialized as key value pairs. key1
will be serialized as a string and the value here as a empty object. Then key2
will also be serialized as a string but it's value will be serialized as a
reference to a object serialized earlier. Do note that this is only for this
specific example because both values reference the same object. If the object
has 2 identical objects like 2x {}
then it will not use a reference because
they're not the same object but rather 2 individuals that have the same inner
values and structure.
Arrays are a lot more complex than meets the eye. They can do quite a few quirky
things like indexing then with strings myArray["HelloWorld"] = 12;
. But they
also allow empty slots. An empty slot is unique. It's not filled with anything
like null
or undefined
(it does map to undefined) but there is nothing.
Which we need to keep in mind when (de)serializing an array. We effectively got
4 types of arrays to keep in mind (you could argue that there are 5. Which is
only a associated array without any regular slots. But this would be the same as
a dense associated array)
- Dense Array (all allocated slots are occupied)
- Sparse Array (some allocated slots are not used)
- Dense Associated Array (all allocated slots are occupied & some values are indexed by strings)
- Sparse Associated Array (some allocated slots are not used & some values are indexed by strings)
Dense Array [null, null]
0x41 0x02 Dense array indicator byte + varint encoded array length
0x30 0x30 null null
0x24 0x00 Ending byte + varint encoded kv pair length
0x02 Varint encoded slot count (array length)
Sparse Array [null, ,null]
0x61 0x03 Sparse array indicator byte + varint encoded array length
0x49 0x00 Integer indicator byte + varint encoded index
0x30 0x49 null + Integer indicator byte
0x04 0x30 varint encoded index + null
0x40 0x02 Ending byte + varint encoded kv pairs length
0x03 Varint encoded slot count (array length)
Dense Associated Array const arr = [null, null]; arr["k"] = null;
0x41 0x02 Dense array indicator byte + varint encoded array length
0x30 0x30 null null
0x22 0x01 string indicator byte + varint encoded string length
0x6B 0x30 key "k" + null
0x24 0x01 Ending byte + varint encoded kv pair length
0x02 Varint encoded slot count (array length)
Sparse Array const arr = [null, ,null]; arr["k"] = null;
0x61 0x03 Sparse array indicator byte + varint encoded array length
0x49 0x00 Integer indicator byte + varint encoded index
0x30 0x49 null + Integer indicator byte
0x04 0x30 varint encoded index + null
0x22 0x01 string indicator byte + varint encoded string length
0x6B 0x30 key "k" + null
0x40 0x03 Ending byte + varint encoded kv pairs length
0x03 Varint encoded slot count (array length)
Classes and plain objects are the same to v8's binary format. They both have the
indicator byte 0x6F o
and ending byte 0x7B {
. They're both just key value
pairs that are kinda similar to associative arrays. The difference being that
each value has a key. Which is not always the case with associative arrays (if
they're dense for example). In all other ways they're pretty much the same. One
thing that is special about objects is that string keys that can be integers
will be stored as integers. so an object like this { "12": null }
will have 12
as a integer rather than a string
empty object {}
0x6F 0x7B Object Indicator byte `o` & Ending byte `{`
0x00 Varint encoded kv pair count
object with string keys { k: null }
0x6F 0x22 Object indicator byte `o` & string indicator byte `"`
0x01 0x6B Varint encoded string length & byte `0x6B` key `k`
0x30 0x7B Value null & ending byte
0x01 Varint encoded kv pair count
object with integer keys { 12: null, "13": null }
0x6F 0x49 Object indicator byte `o` & signed int indicator byte `I`
0x18 0x30 Varint encoded integer as key and null as value
0x49 0x1A Signed int indicator byte `I` & integer as key
0x30 0x7B Null as value & ending byte
0x02 Varint encoded kv pair count
object with string & integer keys { k: null, 12: null, "13": null }
0x6F 0x49 Object indicator byte `o` & signed int indicator byte `I`
0x18 0x30 Varint encoded integer as key and null as value
0x49 0x1A Signed int indicator byte `I` & integer as key
0x30 0x22 Null as value & string indicator byte
0x01 0x6B Varint encoded string length & byte `k`
0x30 0x7B Null as value & ending byte
0x03 Varint encoded kv pair count
ArrayBuffer
is the raw datastore for typed array's like Uint8Array
. You
cannot interact with it directly. But only via DataView
or a typed array. It's
indicator byte is 0x42 B
empty ArrayBuffer(4)
0x42 0x04 ArrayBuffer indicator byte `B` & varint encoded length
0x00 0x00 Bytes in the ArrayBuffer
0x00 0x00 Bytes in the ArrayBuffer
ArrayBuffer with content [1,2,3,4]
0x42 0x04 ArrayBuffer indicator byte `B` & varint encoded length
0x01 0x02 Bytes in de ArrayBuffer
0x03 0x04 Bytes in de ArrayBuffer