-
Notifications
You must be signed in to change notification settings - Fork 57
Home
Dealing with raw bytes is a pain, but there are a number of situations where it can be necessary:
- Interfacing with C libraries
- Communicating via custom network protocols
- Compressing data structures to their most compact representation
Gloss tries to make things simpler, by automatically transforming a simple specification of a byte format into both an encoder and a streaming decoder. It is built into Aleph, where it can greatly simplify communicating via complex protocols.
In Gloss, byte formats are called frames. A frame is a standard Clojure data structure, with types substituted for actual values. For instance, this is a valid frame:
{:a :int16, :b :float32}
To turn a frame into something that can handle bytes, we call compile-frame
. This will return a codec, which can be used with encode
and decode
.
> (def fr (compile-frame {:a :int16, :b :float32}))
#'fr
> (encode fr {:a 1, :b 2})
[ #<HeapByteBuffer java.nio.HeapByteBuffer[pos=0 lim=6 cap=6]> ]
> (decode fr *1)
{:a 1, :b 2.0}
(defcodec name frame)
can also be used to create a new codec.
Notice that encode
returns a sequence of ByteBuffers. Gloss consumes and emits sequences of ByteBuffers, because it’s designed to deal with streaming data. Turning these sequences into a contiguous ByteBuffer can be accomplished by calling (contiguous buffer-sequence)
, but this is only necessary when interfacing with external libraries.
To encode and decode sequences of frames, use encode-all
and decode-all
.
> (defcodec fr :float32)
#'fr
> (encode-all fr [1 2])
[ #<HeapByteBuffer java.nio.HeapByteBuffer[pos=0 lim=4 cap=4]> #<HeapByteBuffer java.nio.HeapByteBuffer[pos=0 lim=4 cap=4]> ]
> (decode-all fr *1)
[1.0 2.0]
Gloss supports the following primitive types: :byte
, :int16
, :int32
, :int64
, :float32
, and :float64
. Unsigned integer types are not currently supported.
Defining a simple stream of text is simple:
(string :utf-8)
The second argument can be any valid name for a standard character set.
The above works great if our entire data structure is a string, but doesn’t allow for mixing different types of data with a string. If we want to limit the length of the string, we have two choices: we can make the string fixed length, or terminate it with a delimiter. The first case is straightforward:
(string :utf-8 :length 10)
This defines a string which only contains ten characters. This means that it will consume a finite number of bytes, and we can place it in a data structure beside other data types:
[:int32 (string :utf-8 :length 10) :int32]
Delimiters are common tools for marking the end of strings. Every string in C, for instance, is by convention null terminated. Text files are often delimited by newline characters. When dealing with strings, we may specify a list of possible delimiters, specified as ByteBuffers or anything which can be transformed into a ByteBuffer.
(string :utf-8 :delimiters ["abc" \z 32 [1 2 3]])
This specifies a string which terminated by one of the available terminators: the byte sequence corresponding the the string “abc”, the byte corresponding to the character ‘z’, the byte value ‘32’, or the byte sequence [1 2 3]. The largest delimiter will always be consumed, so we can specify two delimiters where one is just a shorter version of the other:
(string :utf-8 :delimiters ["\n" "\r\n"])
It can be useful to have a simple mapping between values and a unique numerical identifier.
(enum :a :b :c)
This assigns unique numbers to each value, and allows the enumeration to be used as a data-type in other frames.
> (defcodec animal (enum :dog :cat :horse))
#'animal
> (defcodec pet
{:name (string :utf-8 :delimiter "\n")
:type animal})
A header is a frame which specifies the following frame. A header is created using (header frame header->body body->header)
. The first argument is the frame for the header. The second argument is a function which takes the value from the header, and returns a codec for the body. The third argument is a function which takes the value of the body, and returns the header value.
Let’s look at a frame that can describe a rectangle, a triangle, or a circle:
(defcodec type (enum :rectangle :triangle :circle))
(defcodec triangle {:type :triangle, :width :int32, :height :int32})
(defcodec rectangle {:type :rectangle, :width :int32, :height :int32})
(defcodec circle {:type :circle, :radius :int32})
(defcodec shapes
(header
type
{:triangle triangle, :rectangle rectangle, :circle circle}
:type))
Each frame starts with an enum, and ‘header→body’ is just a hash of enum values to codecs. Since the frames have :type
hardcoded, going from the decoded frame to the header value is trivial.
We can have sequences of the same data-type:
[:int32 :int32 :int32]
but this only works for sequences of fixed length. To support dynamically sized sequences, we need to use repeated
:
(repeated :int32)
This will encode to a sequence of 32-bit integers, with an integer prepended that describes the length. By default it will be a 32-bit integer, but this can customized using prefix
:
(repeated [:int16 :int16] :prefix (prefix :byte))
Any primitive type can be used as a prefix. More complex prefixes can also be created, in a similar manner to header
. (prefix frame to-integer from-integer)
requires a frame, a function which returns the length of the sequence given the prefix, and a function which returns the prefix given the sequence length. For instance, consider a prefix which contains the length printed as a string:
(prefix
(string :ascii :delimiters ["x"])
#(Integer/parseInt %)
str))
We can also create a sequence terminated by a delimiter:
(repeated (string :utf-8 :delimiters ["\n"]) :delimiters ["\\0"])
Notice that both the string and the sequence are delimited. There will be a \n
delimiter for each string, and a single \0
delimiter for the entire sequence.