Skip to content

Commit

Permalink
[Doc][Spec] fury java spec (apache#1238)
Browse files Browse the repository at this point in the history
  • Loading branch information
chaokunyang committed Dec 20, 2023
1 parent 8e1dc5c commit fd67af5
Showing 1 changed file with 197 additions and 36 deletions.
233 changes: 197 additions & 36 deletions docs/protocols/java_object_graph.md
Original file line number Diff line number Diff line change
@@ -1,98 +1,259 @@
# Java Serialization
The data are serialized using little endian order overall.
# Fury Java Serialization Specification

## Spec overview

The data are serialized using little endian order overall. If bytes swap is costly, the byte order will be encoded as a
flag in data.

The overall format are:

```
| fury header | object ref meta | object class meta | object value data |
```

## Fury header

Fury header consists starts one byte:

```
| resvered 4 bits | oob | xlang | endian | null |
```

- null flag: set when object is null, unset otherwise. If object is null, other bits won't be set.
- endian flag: set when system use little endian, unset otherwise.
- xlang flag: set when serialization uses xlang format, unset when serialization use Fury java format.
- oob flag: set when passed `BufferCallback` is not null, unset otherwise.

If meta share mode is enabled, uncompressed little-endian 4 bytes is appended to indicate the start offset of meta data.

## Reference Meta

Reference tracking handles whether the object is null, and whether to track reference for the object by writing
corresponding flags and maintain internal state.

Reference flags:

| Flag | Byte Value | Description |
|---------------------|------------|-------------------------------------------------------------------------------------------------------------------------------|
| NULL FLAG | `-3` | This flag indicates that object is a not-null value. We don't use another byte to indicate REF, so that we can save one byte. |
| REF FLAG | `-2` | this flag indicates the object is written before, and fury will write a unsigned ref id instead of serialize it again |
| NOT_NULL VALUE FLAG | `-1` | this flag indicates that the object is a non-null value and fury doesn't track ref for this type of object. |
| REF VALUE FLAG | `0` | this flag indicates that the object is a referencable and first read. |

When reference tracking is disabled globally or only for some type, or for some type under some context such as some
field of a class, only `NULL FLAG` and ` NOT_NULL VALUE FLAG` will be used.

## Class Meta

Depending on whether meta share mode is enabled, Fury will write class meta differently.

### Schema consistent

If schema consistent mode is enabled globally or enabled for current class, class meta will be written as follows:

- If class is registered, it will be written as a little-endian unsigned int: `class_id << 1` using fury unsigned int
format.
- If class is not registered, fury will write one byte `0b1` first, the little bit is different first bit of encoded
class id, which is `0`. Fury can use this information to determine whether read class by class id.
- If meta share mode is enabled, class will be written as a unsigned int.
- If meta share mode is not enabled, class will be written as two enumerated string:
- package name.
- class name.

### Schema evolution

If schema evolution mode is enabled globally or enabled for current class, class meta will be written as follows:

- If meta share mode is not enabled, class meta will be written as scheme consistent mode, field meta such as field type
and name will be written when the object value is being serialized using a key-value like layout.
- If meta share mode is enabled, class will be written as a unsigned int.

## Meta share
> This mode will forbid streaming writing since it needs to look back for update the offset after the whole object graph
> writing and mete collecting is finished.
> TODO: We have plan to streamline meta writing but not started yet.
### Schema consistent


### Schema evolution


## Enumerated String

Enumerated string are mainly used to encode class name and field names. The format consists of header and binary.

Header are written using little endian order, Fury can read this flag first to determine how to deserialize the data.

### Header

#### Write by data

If string hasn't been written before, the data will be written as follows:

```
| unsigned int: string binary size + 1bit: not written before | 61bits: murmur hash + 3 bits encoding flags | string binary |
```

| Encoding Flag | Pattern | Encoding Action |
|---------------|-----------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
| 0 | every char is in `a-z._$\|` | `LOWER_SPECIAL` |
| 1 | every char is in `a-z._$` except first char is upper case | replace first upper case char to lower case, then use `LOWER_SPECIAL` |
| 2 | every char is in `a-zA-Z._$` | replace every upper case char by `\|` + `lower case`, then use `LOWER_SPECIAL`, use this encoding if it's smaller than Encoding `3` |
| 3 | every char is in `a-zA-Z._$` | use `LOWER_UPPER_DIGIT_SPECIAL` encoding if it's smaller than Encoding `2` |
| 4 | any utf-8 char | use `UTF-8` encoding |

#### Write by ref
If string has been written before, the data will be written as follows:
```
| unsigned int: written string id + 1bit: written before |
```

### String binary

String binary encoding:

| Algorithm | Pattern | Description |
|---------------------------|----------------|-------------------------------------------------------------------------------------------------------------------------------------------------|
| LOWER_SPECIAL | `a-z._$\|` | every char is writen using 5 bits, `a-z`: `0b00000~0b11001`, `._$\|`: `0b11010~0b11101` |
| LOWER_UPPER_DIGIT_SPECIAL | `a-zA-Z0~9._$` | every char is writen using 6 bits, `a-z`: `0b00000~0b11110`, `A-Z`: `0b11010~0b110011`, `0~9`: `0b110100~0b111101`, `._$`: `0b111110~0b1000000` |
| UTF-8 | any chars | UTF-8 encoding |

## Value Format

### Basic types

#### Bool

## Basic types
### boolean
- size: 1 byte
- format: 0 for `false`, 1 for `true`

### byte
#### Byte

- size: 1 byte
- format: write as pure byte.

### short
#### Short

- size: 2 byte
- byte order: little endian order

### char
#### Char

- size: 2 byte
- byte order: little endian order

### int
#### Unsigned int

- size: 1~5 byte
- positive int format: first bit in every byte indicate whether has next byte. if first bit is set i.e. `b & 0x80 == 0x80`, then next byte should be read util first bit of next byte is unset.
- Negative number will be converted to positive number by ` (v << 1) ^ (v >> 31)` to reduce cost of small negative numbers.
- Format: first bit in every byte indicate whether to has next byte. if first bit is set i.e. `b & 0x80 == 0x80`, then
next byte should be read util first bit of next byte is unset.

#### Signed int

- size: 1~5 byte
- Format: First convert the number into positive unsigned int by `(v << 1) ^ (v >> 31)` ZigZag algorithm, then encoding
it as an unsigned int.

#### Unsigned long

- size: 1~9 byte
- Fury PVL(Progressive Variable-length Long) Encoding:
- positive long format: first bit in every byte indicate whether to has next byte. if first bit is set
i.e. `b & 0x80 == 0x80`, then next byte should be read util first bit is unset.

#### Signed long

### long
- size: 1~9 byte
- Fury SLI(Small long as int) Encoding:
- If long is in [-1073741824, 1073741823], encode as 4 bytes int: `| little-endian: ((int) value) << 1 |`
- Otherwise write as 9 bytes: `| 0b1 | little-endian 8bytes long |`
- If long is in [-1073741824, 1073741823], encode as 4 bytes int: `| little-endian: ((int) value) << 1 |`
- Otherwise write as 9 bytes: `| 0b1 | little-endian 8bytes long |`
- Fury PVL(Progressive Variable-length Long) Encoding:
- positive long format: first bit in every byte indicate whether has next byte. if first bit is set i.e. `b & 0x80 == 0x80`, then next byte should be read util first bit is unset.
- Negative number will be converted to positive number by ` (v << 1) ^ (v >> 63)` to reduce cost of small negative numbers.
- First convert the number into positive unsigned long by ` (v << 1) ^ (v >> 63)` ZigZag algorithm to reduce cost of
small negative numbers, then encoding it as an unsigned long.

#### Float

### float
- size: 4 byte
- format: convert float to 4 bytes int by `Float.floatToRawIntBits`, then write as binary by little endian order.

### double
#### Double

- size: 8 byte
- format: convert double to 8 bytes int by `Double.doubleToRawLongBits`, then write as binary by little endian order.

## String
### String

Format:

- one byte for encoding: 0 for `latin`, 1 for `utf-16`, 2 for `utf-8`.
- positive varint for encoded string binary length.
- encoded string binary data based on encoding: `latin/utf-16/utf-8`.

Which encoding to choose:
- For JDK8: fury detect `latin` at runtime, if string is `latin` string, then use `latin` encoding, otherwise use `utf-16`.

- For JDK8: fury detect `latin` at runtime, if string is `latin` string, then use `latin` encoding, otherwise
use `utf-16`.
- For JDK9+: fury use `coder` in `String` object for encoding, `latin`/`utf-16` will be used for encoding.
- If the string is encoded by `utf-8`, then fury will use `utf-8` to decode the data. But currently fury doesn't enable utf-8 encoding by default for java. Cross-language string serialization of fury use `utf-8` by default.
- If the string is encoded by `utf-8`, then fury will use `utf-8` to decode the data. But currently fury doesn't enable
utf-8 encoding by default for java. Cross-language string serialization of fury use `utf-8` by default.

## Array
### Collection

## Collection
> All collection serializer must extends `io.fury.serializer.collection.CollectionSerializer`.
Format:
```java
length(positive varint) | collection header | elements header | elements data

```
length(unsigned varint) | collection header | elements header | elements data
```

#### Collection header

### collection header
- For `ArrayList/LinkedArrayList/HashSet/LinkedHashSet`, this will be empty.
- For `TreeSet`, this will be `Comparator`
- For subclass of `ArrayList`, this may be extra object field info.

### elements header
In most cases, all collection elements are same type and not null, elements header will encode those homogeneous
information to avoid the cost of writing it for every elements. Specifically, there are four kinds of information
#### Elements header

In most cases, all collection elements are same type and not null, elements header will encode those homogeneous
information to avoid the cost of writing it for every elements. Specifically, there are four kinds of information
which will be encoded by elements header, each use one bit:

- Whether track elements ref, use first bit `0b1` of header to flag it.
- Whether collection has null, use second bit `0b10` of header to flag it. If ref tracking is enabled for this
element type, this flag is invalid.
- Whether collection elements type is not declare type, use 3rd bit `0b100` of header to flag it.
- Whether collection has null, use second bit `0b10` of header to flag it. If ref tracking is enabled for this
element type, this flag is invalid.
- Whether collection elements type is not declare type, use 3rd bit `0b100` of header to flag it.
- Whether collection elements type different, use 4rd bit `0b1000` of header to flag it.

By default, all bits are unset, which means all elements won't track ref, all elements are same type,, not null and the
By default, all bits are unset, which means all elements won't track ref, all elements are same type,, not null and the
actual element is the declare type in custom class field.

### elements data
#### Elements data

Based on the elements header, the serialization of elements data may skip `ref flag`/`null flag`/`element class info`.

`io.fury.serializer.collection.CollectionSerializer#write/read` can be taken as an example.

## Map

### Array

## Object
#### Primitive array

#### Object array

### Map

### Enum

Enum are serialized as an

### Object

### Class

## Implementation guidelines

- Try to merge multiple bytes into an int/long write before writing to reduce memory IO and bound check cost.
- Read multiple bytes as an int/long, then spilt into multiple bytes to reduce memory IO and bound check cost.
- Try to use one varint/long to write flags and length together to save one byte cost and reduce memory io.
- Condition branch is less expensive compared to memory IO cost unless there are too much branches.

0 comments on commit fd67af5

Please sign in to comment.