Use Utf8Json as the internal serializer #3493

russcam · 2018-11-19T08:31:04Z

This issue has been opened to discuss moving the internal serializer from Json.NET over to a faster JSON serialization library.

The feature/utf8json-serializer branch contains a minimal viable prototype of deserializing an ISearchResponse<T> and serializing ISearchRequest.

Some key observations working with utf8json whilst putting together this prototype:

Hit<T> requires a custom formatter to be resolved at the IJsonFormatterResolver level because it contains a generic type property whose formatter, SourceFormatter<T>, cannot be resolved using JsonFormatterAttribute. If it were possible to resolve, then it would be possible to attribute Hit<T> with [JsonFormatter(typeof(HitFormatter<>))], and have the _source field attributed with [JsonFormatter(typeof(SourceFormatter<>))]. For now, initialize an instance of SourceFormatter<T> inside the HitFormatter<T> constructor.
Implementation does not handle different field casings
HitFormatter<T> avoids allocating strings when reading property names by using AutomataDictionary. This dictionary lives outside of the generic HitFormatter<T> to avoid creating an instance of the dictionary for each T.
Both JsonReader and JsonWriter are structs passed by ref, so cannot be captured inside of local
functions or lambda expression bodies, but instead would need to be passed as a ref parameter to a function. An example is JoinFieldFormatter's Serialize method.
utf8json does not have a similar concept to [JsonObject(MemberSerialization.OptIn)] to
only serialize those members that have been explicitly attributed with DataMemberAttribute.
This is something that would ideally be needed as it is cumbersome to set [IgnoreDataMember]
on all properties that should be ignored.
ConnectionSettings is retrieved by casting IJsonFormatterResolver to a known concrete
implementation that exposes it as a property. Not ideal, but it works.
utf8json does not make a distinction between an integer token and a float token as Json.NET
does. This is not so much of a problem, since the bytes for the token can be inspected to determine
if they contain a decimal point, and use utf8json's internals to deserialize accordingly. Also, this
is needed only in cases where an integer/double distinction is necessary. See FuzzinessFormatter
for an example.
The equivalent to JsonConverter, IJsonFormatter<T>, only has a generic variant. In several places
in the client, we may serialize using the an interface, but deserialize using the concrete implementation.
This is handled by ConcreteInterfaceFormatter<TConcrete, TInterface>, where the formatter
is IJsonFormatter<TInterface>. An interesting case is when the concrete type should be serialized
as the interface; in such scenarios, we end up with two formatters, one for the concrete type and one
for the interface, where each formatter references the others' serialize/deserialize implementation. See
QueryContainerFormatter and QueryContainerInterfaceFormatter for an example.

Benchmarking the feature/utf8json-serializer branch against the 6.4.0 nuget package in deserializing a fixed byte response of 100, 1000 or 10000 Stackoverflow questions, the following results are collected.

BenchmarkDotNet=v0.11.2.856-nightly, OS=Windows 10.0.17134.285 (1803/April2018Update/Redstone4)
Intel Core i7-4980HQ CPU 2.80GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=2.1.500
  [Host]     : .NET Core 2.1.6 (CoreCLR 4.6.27019.06, CoreFX 4.6.27019.05), 64bit RyuJIT
  Job-EXDGCR : .NET Core 2.1.6 (CoreCLR 4.6.27019.06, CoreFX 4.6.27019.05), 64bit RyuJIT
MinInvokeCount=30  MinIterationTime=500.0000 ms  Jit=RyuJit  
Platform=AnyCpu

100 Stackoverflow questions

Method	Mean	Error	StdDev	Median	Ratio	RatioSD	Gen 0/1k Op	Gen 1/1k Op	Gen 2/1k Op	Allocated Memory/Op
Search6x	1,786.4 us	15.78 us	14.76 us	1,785.3 us	1.00	0.00	87.8906	42.9688	-	540.57 KB
Search6xAsync	1,810.0 us	36.57 us	50.06 us	1,792.0 us	1.02	0.03	87.8906	42.9688	-	541.06 KB
Search6xJsonNetSerializer	5,557.4 us	86.19 us	80.62 us	5,547.7 us	3.11	0.04	554.6875	250.0000	15.6250	3450.89 KB
Search6xJsonNetSerializerAsync	4,923.2 us	101.23 us	204.48 us	4,929.8 us	2.72	0.10	-	-	-	3451.38 KB
SearchBleeding	933.0 us	23.24 us	67.43 us	911.9 us	0.51	0.02	-	-	-	679.26 KB
SearchBleedingAsync	949.2 us	24.30 us	70.88 us	931.7 us	0.54	0.02	-	-	-	679.96 KB
SearchBleedingJsonNetSerializer	931.4 us	22.70 us	64.76 us	917.0 us	0.53	0.04	-	-	-	679.26 KB
SearchBleedingJsonNetSerializerAsync	926.7 us	25.70 us	74.97 us	901.9 us	0.53	0.05	-	-	-	679.96 KB

1000 Stackoverflow questions

Method	Mean	Error	StdDev	Median	Ratio	RatioSD	Gen 0/1k Op	Gen 1/1k Op	Gen 2/1k Op	Allocated Memory/Op
Search6x	19.473 ms	0.1661 ms	0.1554 ms	19.511 ms	1.00	0.00	625.0000	281.2500	62.5000	3.71 MB
Search6xAsync	15.165 ms	0.2438 ms	0.2162 ms	15.121 ms	0.78	0.01	-	-	-	3.71 MB
Search6xJsonNetSerializer	50.683 ms	0.9887 ms	1.6790 ms	50.935 ms	2.63	0.09	4000.0000	1000.0000	-	29.6 MB
Search6xJsonNetSerializerAsync	50.297 ms	1.0050 ms	1.6229 ms	50.319 ms	2.56	0.10	4000.0000	1000.0000	-	29.6 MB
SearchBleeding	8.276 ms	0.1817 ms	0.5328 ms	7.994 ms	0.43	0.03	-	-	-	6.38 MB
SearchBleedingAsync	7.887 ms	0.1972 ms	0.3012 ms	7.790 ms	0.41	0.02	-	-	-	6.38 MB
SearchBleedingJsonNetSerializer	8.189 ms	0.1745 ms	0.4565 ms	7.954 ms	0.42	0.02	-	-	-	6.38 MB
SearchBleedingJsonNetSerializerAsync	7.862 ms	0.2369 ms	0.4149 ms	7.687 ms	0.40	0.02	-	-	-	6.38 MB

10,000 Stackoverflow questions

Method	Mean	Error	StdDev	Ratio	RatioSD	Gen 0/1k Op	Gen 1/1k Op	Gen 2/1k Op	Allocated Memory/Op
Search6x	203.2 ms	3.901 ms	3.257 ms	1.00	0.00	6000.0000	2000.0000	-	36.39 MB
Search6xAsync	205.6 ms	2.221 ms	2.078 ms	1.01	0.01	6000.0000	2000.0000	-	36.39 MB
Search6xJsonNetSerializer	558.1 ms	8.880 ms	8.306 ms	2.75	0.06	49000.0000	8000.0000	-	298.77 MB
Search6xJsonNetSerializerAsync	564.7 ms	5.126 ms	4.544 ms	2.78	0.05	49000.0000	8000.0000	-	298.77 MB
SearchBleeding	117.4 ms	1.359 ms	1.271 ms	0.58	0.01	4000.0000	1000.0000	-	90.12 MB
SearchBleedingAsync	114.2 ms	1.980 ms	1.852 ms	0.56	0.01	4000.0000	1000.0000	-	90.12 MB
SearchBleedingJsonNetSerializer	118.2 ms	1.572 ms	1.471 ms	0.58	0.01	4000.0000	1000.0000	-	90.12 MB
SearchBleedingJsonNetSerializerAsync	112.8 ms	2.244 ms	1.989 ms	0.55	0.01	4000.0000	1000.0000	-	90.12 MB

6.x is the 6.4.0 nuget package
*Bleeding is the utf8json branch

A nice advantage of using utf8json as the internal serializer is that the handoff to a custom serializer can be done using a MemoryStream constructed from an ArraySegment<byte>, avoiding the need to read into a JToken and construct a Stream from the token, much reducing serialization time and allocations.

Allocated memory/op

The allocated memory per op is higher across the board with utf8json. To determine if this was a fixed amount of allocated memory/op, two searches were performed per benchmark method. The amount of allocated memory doubles

10,000 Stackoverflow questions with 2 search requests per benchmarked method

Method	Mean	Error	StdDev	Ratio	RatioSD	Gen 0/1k Op	Gen 1/1k Op	Gen 2/1k Op	Allocated Memory/Op
Search6x	402.3 ms	3.155 ms	2.951 ms	1.00	0.00	13000.0000	5000.0000	1000.0000	72.77 MB
Search6xAsync	408.8 ms	11.595 ms	16.630 ms	1.03	0.05	13000.0000	5000.0000	1000.0000	72.78 MB
Search6xJsonNetSerializer	1,100.1 ms	7.118 ms	6.658 ms	2.73	0.03	101000.0000	19000.0000	2000.0000	597.54 MB
Search6xJsonNetSerializerAsync	1,037.2 ms	5.950 ms	5.566 ms	2.58	0.02	100000.0000	19000.0000	1000.0000	597.54 MB
SearchBleeding	259.7 ms	3.799 ms	3.368 ms	0.65	0.01	9000.0000	4000.0000	1000.0000	180.25 MB
SearchBleedingAsync	248.5 ms	3.518 ms	3.291 ms	0.62	0.01	9000.0000	4000.0000	1000.0000	180.25 MB
SearchBleedingJsonNetSerializer	260.5 ms	2.991 ms	2.652 ms	0.65	0.01	9000.0000	4000.0000	1000.0000	180.25 MB
SearchBleedingJsonNetSerializerAsync	246.3 ms	3.146 ms	2.943 ms	0.61	0.01	9000.0000	4000.0000	1000.0000	180.25 MB

The text was updated successfully, but these errors were encountered:

Mpdreamz · 2018-11-19T09:37:48Z

Investigate UTF8Json, fully json compliant?
Unify all byte pool techniques (in Json.NET, UTF8Json, Nest)
Investigate long running application characteristics.

russcam · 2018-11-20T02:07:33Z

I've run a small, simple console app that continuously loops a search request returning a fixed response of 100 StackOverflow questions, and profiled memory usage with dotMemory, using both NEST 6.4.0 and the feature/utf8json-serializer branch.

Comparison overview

Overall, utf8json allocates less total memory and objects. utf8json allocates more total .NET memory, and uses slightly more of that memory than 6.4.0. The breakdown by namespace (doesn't seem to be a nice way to export this data...)

(A) is utf8json
(B) is NEST 6.4.0

Unsurprisingly, NEST 6.4.0 allocates more objects and bytes from the Nest namespace. A large proportion of this can be attributed to the internalized Json.NET types. utf8json allocates more bytes from the System namespace, with the largest contributor down to Byte[]

I've attached a zip of both workspaces that can be opened in dotMemory 2018.2.3:

utf8json_vs_NEST640.zip

utf8json uses a ThreadStatic byte[] for synchronous serialization operations, and an internal ArrayPool<byte> implementation for async operations.

russcam · 2018-11-20T03:29:09Z

Investigate UTF8Json, fully json compliant?

Utf8Json only supports UTF-8 encoding, and may fail with any other encoding. According to the JSON RFC 8259:

8.1. Character Encoding

JSON text exchanged between systems that are not part of a closed
ecosystem MUST be encoded using UTF-8 [RFC3629].

Previous specifications of JSON have not required the use of UTF-8
when transmitting JSON text. However, the vast majority of JSON-
based software implementations have chosen to use the UTF-8 encoding,
to the extent that it is the only encoding that achieves
interoperability.

Implementations MUST NOT add a byte order mark (U+FEFF) to the
beginning of a networked-transmitted JSON text. In the interests of
interoperability, implementations that parse JSON texts MAY ignore
the presence of a byte order mark rather than treating it as an
error.

utf8json handles UTF-8 including BOM

abrobston · 2019-02-05T16:20:06Z

Have you considered Jil?

russcam · 2019-02-06T00:18:37Z

@abrobston Jil was a consideration. There are a few reasons why utf8json was pursued

the API is more similar to JSON.Net's, with the thinking that this will make the replacement process easier to perform
utf8json performance appears to be slightly better than Jil for the use case
Jil looks to be missing some features that are needed by the client e.g. IDictionary<TKey, TValue> implementations where TKey is not a string or enumeration
extending/modifying utf8json to fit the needs of the client looked to be a straightforward venture from the POC
Many types in NEST require custom deserialization routines, which utf8json supports but Jil does not appear to

More generally, the API surface of utf8json looks small enough to introduce types such as Span<T>, Memory<T> and pipelines for frameworks that support them, to improve serialization further.

russcam · 2019-02-08T05:15:42Z

An update on progress! Commits have been going into https://github.com/elastic/elasticsearch-net/tree/feature/utf8json-serializer branch, with only a few remaining failing unit tests and integration tests (these are expected right now and will be fixed).

I want to summarize the changes thus far, and also to itemize the remaining items that I'm aware of:

Breaking Changes

Dynamic code generation with Reflection.Emit

utf8json uses Relfection.Emit to generate formatters for types. Reflection.Emit is not supported on all platforms e.g. UWP, Xamarin.iOS, Xamarin.Android.
DynamicResponse deserializes JSON arrays to List<object>

SimpleJson deserializes JSON arrays to object[], but utf8json, through PrimitiveObjectFormatter, deserializes them to List<object>. Whilst this could be changed to call .ToArray() on List<object>, current thinking is that the change to returning List<object> would be preferred for allocation and performance reasons.
LazyDocument holds byte array

LazyDocument holds the value in a byte array copied from the response bytes, as opposed to the current implementation that holds an internal JObject deserialized from bytes. This means that the bytes may represent indented JSON; care needs to be taken if serializing LazyDocument for an API that expects newline delimited JSON, such as bulk API.

Outstanding

Support PrettyJson in serialized requests when Formatting.Indented specified

utf8json optimizes to generating formatters with no indentation, and prettifying JSON is left to reading the resulting serialized bytes and writing with indentation to another buffer.
Support extending NEST types by deriving and implementing properties

Most formatters are using known interfaces to retrieve type/property information for the purposes of serialization. This approach will not support extending NEST types. Using the fallback IJsonFormatter<object> formatter will work, but does mean that a formatter is generated for each type.
Low level client exception structure should match SimpleJson
Double values should always be serialized with decimal point

Floating point and decimal numbers that do not have fractional values are not serialized with decimal point by utf8json. This saves a few bytes, but would break Elasticsearch type mapping inference.
Source include utf8json in Elasticsearch.Net, and mark InternalsVisibleTo() for NEST and dynamic assemblies

Currently added as a git submodule. Whilst source including makes pulling in upstream changes trickier, specific client customizations should be easier to make than if the serializer were in a separate assembly.
Dynamic assemblies should be strongly named

Allows formatters within NEST to be internal and InternalsVisibleTo() dynamic assemblies

russcam · 2019-04-10T06:10:55Z

I've moved the outstanding issues into separate issues to keep their scope focused. I'm going to close this issue now as utf8json has been implemented into 7.x.

russcam added Research Discuss labels Nov 19, 2018

russcam mentioned this issue Mar 4, 2019

Replace internal serializer with utf8json #3583

Closed

Mpdreamz mentioned this issue Mar 16, 2019

Feature/utf8json serializer against 7.x #3608

Merged

russcam closed this as completed Apr 10, 2019

russcam added the v7.0.0 label Apr 15, 2019

Mpdreamz mentioned this issue Aug 10, 2020

NameValueCollectionExtensions Performance Optimisation #4951

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Utf8Json as the internal serializer #3493

Use Utf8Json as the internal serializer #3493

russcam commented Nov 19, 2018 •

edited

Loading

Mpdreamz commented Nov 19, 2018 •

edited by russcam

Loading

russcam commented Nov 20, 2018

russcam commented Nov 20, 2018 •

edited

Loading

abrobston commented Feb 5, 2019

russcam commented Feb 6, 2019

russcam commented Feb 8, 2019 •

edited

Loading

russcam commented Apr 10, 2019

Use Utf8Json as the internal serializer #3493

Use Utf8Json as the internal serializer #3493

Comments

russcam commented Nov 19, 2018 • edited Loading

100 Stackoverflow questions

1000 Stackoverflow questions

10,000 Stackoverflow questions

Allocated memory/op

10,000 Stackoverflow questions with 2 search requests per benchmarked method

Mpdreamz commented Nov 19, 2018 • edited by russcam Loading

russcam commented Nov 20, 2018

Comparison overview

russcam commented Nov 20, 2018 • edited Loading

abrobston commented Feb 5, 2019

russcam commented Feb 6, 2019

russcam commented Feb 8, 2019 • edited Loading

Breaking Changes

Outstanding

russcam commented Apr 10, 2019

russcam commented Nov 19, 2018 •

edited

Loading

Mpdreamz commented Nov 19, 2018 •

edited by russcam

Loading

russcam commented Nov 20, 2018 •

edited

Loading

russcam commented Feb 8, 2019 •

edited

Loading