Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Utf8Json as the internal serializer #3493

Closed
russcam opened this issue Nov 19, 2018 · 7 comments
Closed

Use Utf8Json as the internal serializer #3493

russcam opened this issue Nov 19, 2018 · 7 comments
Labels

Comments

@russcam
Copy link
Contributor

russcam commented Nov 19, 2018

This issue has been opened to discuss moving the internal serializer from Json.NET over to a faster JSON serialization library.

The feature/utf8json-serializer branch contains a minimal viable prototype of deserializing an ISearchResponse<T> and serializing ISearchRequest.

Some key observations working with utf8json whilst putting together this prototype:

  1. Hit<T> requires a custom formatter to be resolved at the IJsonFormatterResolver level because it contains a generic type property whose formatter, SourceFormatter<T>, cannot be resolved using JsonFormatterAttribute. If it were possible to resolve, then it would be possible to attribute Hit<T> with [JsonFormatter(typeof(HitFormatter<>))], and have the _source field attributed with [JsonFormatter(typeof(SourceFormatter<>))]. For now, initialize an instance of SourceFormatter<T> inside the HitFormatter<T> constructor.

  2. Implementation does not handle different field casings

  3. HitFormatter<T> avoids allocating strings when reading property names by using AutomataDictionary. This dictionary lives outside of the generic HitFormatter<T> to avoid creating an instance of the dictionary for each T.

  4. Both JsonReader and JsonWriter are structs passed by ref, so cannot be captured inside of local
    functions or lambda expression bodies, but instead would need to be passed as a ref parameter to a function. An example is JoinFieldFormatter's Serialize method.

  5. utf8json does not have a similar concept to [JsonObject(MemberSerialization.OptIn)] to
    only serialize those members that have been explicitly attributed with DataMemberAttribute.
    This is something that would ideally be needed as it is cumbersome to set [IgnoreDataMember]
    on all properties that should be ignored.

  6. ConnectionSettings is retrieved by casting IJsonFormatterResolver to a known concrete
    implementation that exposes it as a property. Not ideal, but it works.

  7. utf8json does not make a distinction between an integer token and a float token as Json.NET
    does. This is not so much of a problem, since the bytes for the token can be inspected to determine
    if they contain a decimal point, and use utf8json's internals to deserialize accordingly. Also, this
    is needed only in cases where an integer/double distinction is necessary. See FuzzinessFormatter
    for an example.

  8. The equivalent to JsonConverter, IJsonFormatter<T>, only has a generic variant. In several places
    in the client, we may serialize using the an interface, but deserialize using the concrete implementation.
    This is handled by ConcreteInterfaceFormatter<TConcrete, TInterface>, where the formatter
    is IJsonFormatter<TInterface>. An interesting case is when the concrete type should be serialized
    as the interface; in such scenarios, we end up with two formatters, one for the concrete type and one
    for the interface, where each formatter references the others' serialize/deserialize implementation. See
    QueryContainerFormatter and QueryContainerInterfaceFormatter for an example.

Benchmarking the feature/utf8json-serializer branch against the 6.4.0 nuget package in deserializing a fixed byte response of 100, 1000 or 10000 Stackoverflow questions, the following results are collected.

BenchmarkDotNet=v0.11.2.856-nightly, OS=Windows 10.0.17134.285 (1803/April2018Update/Redstone4)
Intel Core i7-4980HQ CPU 2.80GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=2.1.500
  [Host]     : .NET Core 2.1.6 (CoreCLR 4.6.27019.06, CoreFX 4.6.27019.05), 64bit RyuJIT
  Job-EXDGCR : .NET Core 2.1.6 (CoreCLR 4.6.27019.06, CoreFX 4.6.27019.05), 64bit RyuJIT
MinInvokeCount=30  MinIterationTime=500.0000 ms  Jit=RyuJit  
Platform=AnyCpu

100 Stackoverflow questions

Method Mean Error StdDev Median Ratio RatioSD Gen 0/1k Op Gen 1/1k Op Gen 2/1k Op Allocated Memory/Op
Search6x 1,786.4 us 15.78 us 14.76 us 1,785.3 us 1.00 0.00 87.8906 42.9688 - 540.57 KB
Search6xAsync 1,810.0 us 36.57 us 50.06 us 1,792.0 us 1.02 0.03 87.8906 42.9688 - 541.06 KB
Search6xJsonNetSerializer 5,557.4 us 86.19 us 80.62 us 5,547.7 us 3.11 0.04 554.6875 250.0000 15.6250 3450.89 KB
Search6xJsonNetSerializerAsync 4,923.2 us 101.23 us 204.48 us 4,929.8 us 2.72 0.10 - - - 3451.38 KB
SearchBleeding 933.0 us 23.24 us 67.43 us 911.9 us 0.51 0.02 - - - 679.26 KB
SearchBleedingAsync 949.2 us 24.30 us 70.88 us 931.7 us 0.54 0.02 - - - 679.96 KB
SearchBleedingJsonNetSerializer 931.4 us 22.70 us 64.76 us 917.0 us 0.53 0.04 - - - 679.26 KB
SearchBleedingJsonNetSerializerAsync 926.7 us 25.70 us 74.97 us 901.9 us 0.53 0.05 - - - 679.96 KB

1000 Stackoverflow questions

Method Mean Error StdDev Median Ratio RatioSD Gen 0/1k Op Gen 1/1k Op Gen 2/1k Op Allocated Memory/Op
Search6x 19.473 ms 0.1661 ms 0.1554 ms 19.511 ms 1.00 0.00 625.0000 281.2500 62.5000 3.71 MB
Search6xAsync 15.165 ms 0.2438 ms 0.2162 ms 15.121 ms 0.78 0.01 - - - 3.71 MB
Search6xJsonNetSerializer 50.683 ms 0.9887 ms 1.6790 ms 50.935 ms 2.63 0.09 4000.0000 1000.0000 - 29.6 MB
Search6xJsonNetSerializerAsync 50.297 ms 1.0050 ms 1.6229 ms 50.319 ms 2.56 0.10 4000.0000 1000.0000 - 29.6 MB
SearchBleeding 8.276 ms 0.1817 ms 0.5328 ms 7.994 ms 0.43 0.03 - - - 6.38 MB
SearchBleedingAsync 7.887 ms 0.1972 ms 0.3012 ms 7.790 ms 0.41 0.02 - - - 6.38 MB
SearchBleedingJsonNetSerializer 8.189 ms 0.1745 ms 0.4565 ms 7.954 ms 0.42 0.02 - - - 6.38 MB
SearchBleedingJsonNetSerializerAsync 7.862 ms 0.2369 ms 0.4149 ms 7.687 ms 0.40 0.02 - - - 6.38 MB

10,000 Stackoverflow questions

Method Mean Error StdDev Ratio RatioSD Gen 0/1k Op Gen 1/1k Op Gen 2/1k Op Allocated Memory/Op
Search6x 203.2 ms 3.901 ms 3.257 ms 1.00 0.00 6000.0000 2000.0000 - 36.39 MB
Search6xAsync 205.6 ms 2.221 ms 2.078 ms 1.01 0.01 6000.0000 2000.0000 - 36.39 MB
Search6xJsonNetSerializer 558.1 ms 8.880 ms 8.306 ms 2.75 0.06 49000.0000 8000.0000 - 298.77 MB
Search6xJsonNetSerializerAsync 564.7 ms 5.126 ms 4.544 ms 2.78 0.05 49000.0000 8000.0000 - 298.77 MB
SearchBleeding 117.4 ms 1.359 ms 1.271 ms 0.58 0.01 4000.0000 1000.0000 - 90.12 MB
SearchBleedingAsync 114.2 ms 1.980 ms 1.852 ms 0.56 0.01 4000.0000 1000.0000 - 90.12 MB
SearchBleedingJsonNetSerializer 118.2 ms 1.572 ms 1.471 ms 0.58 0.01 4000.0000 1000.0000 - 90.12 MB
SearchBleedingJsonNetSerializerAsync 112.8 ms 2.244 ms 1.989 ms 0.55 0.01 4000.0000 1000.0000 - 90.12 MB
  • 6.x is the 6.4.0 nuget package
  • *Bleeding is the utf8json branch

A nice advantage of using utf8json as the internal serializer is that the handoff to a custom serializer can be done using a MemoryStream constructed from an ArraySegment<byte>, avoiding the need to read into a JToken and construct a Stream from the token, much reducing serialization time and allocations.

Allocated memory/op

The allocated memory per op is higher across the board with utf8json. To determine if this was a fixed amount of allocated memory/op, two searches were performed per benchmark method. The amount of allocated memory doubles

10,000 Stackoverflow questions with 2 search requests per benchmarked method

Method Mean Error StdDev Ratio RatioSD Gen 0/1k Op Gen 1/1k Op Gen 2/1k Op Allocated Memory/Op
Search6x 402.3 ms 3.155 ms 2.951 ms 1.00 0.00 13000.0000 5000.0000 1000.0000 72.77 MB
Search6xAsync 408.8 ms 11.595 ms 16.630 ms 1.03 0.05 13000.0000 5000.0000 1000.0000 72.78 MB
Search6xJsonNetSerializer 1,100.1 ms 7.118 ms 6.658 ms 2.73 0.03 101000.0000 19000.0000 2000.0000 597.54 MB
Search6xJsonNetSerializerAsync 1,037.2 ms 5.950 ms 5.566 ms 2.58 0.02 100000.0000 19000.0000 1000.0000 597.54 MB
SearchBleeding 259.7 ms 3.799 ms 3.368 ms 0.65 0.01 9000.0000 4000.0000 1000.0000 180.25 MB
SearchBleedingAsync 248.5 ms 3.518 ms 3.291 ms 0.62 0.01 9000.0000 4000.0000 1000.0000 180.25 MB
SearchBleedingJsonNetSerializer 260.5 ms 2.991 ms 2.652 ms 0.65 0.01 9000.0000 4000.0000 1000.0000 180.25 MB
SearchBleedingJsonNetSerializerAsync 246.3 ms 3.146 ms 2.943 ms 0.61 0.01 9000.0000 4000.0000 1000.0000 180.25 MB
@Mpdreamz
Copy link
Member

Mpdreamz commented Nov 19, 2018

  • Investigate UTF8Json, fully json compliant?
  • Unify all byte pool techniques (in Json.NET, UTF8Json, Nest)
  • Investigate long running application characteristics.

@russcam
Copy link
Contributor Author

russcam commented Nov 20, 2018

I've run a small, simple console app that continuously loops a search request returning a fixed response of 100 StackOverflow questions, and profiled memory usage with dotMemory, using both NEST 6.4.0 and the feature/utf8json-serializer branch.

Comparison overview

image

Overall, utf8json allocates less total memory and objects. utf8json allocates more total .NET memory, and uses slightly more of that memory than 6.4.0. The breakdown by namespace (doesn't seem to be a nice way to export this data...)

image

  • (A) is utf8json
  • (B) is NEST 6.4.0

Unsurprisingly, NEST 6.4.0 allocates more objects and bytes from the Nest namespace. A large proportion of this can be attributed to the internalized Json.NET types. utf8json allocates more bytes from the System namespace, with the largest contributor down to Byte[]

image

I've attached a zip of both workspaces that can be opened in dotMemory 2018.2.3:

utf8json_vs_NEST640.zip

utf8json uses a ThreadStatic byte[] for synchronous serialization operations, and an internal ArrayPool<byte> implementation for async operations.

@russcam
Copy link
Contributor Author

russcam commented Nov 20, 2018

  • Investigate UTF8Json, fully json compliant?

Utf8Json only supports UTF-8 encoding, and may fail with any other encoding. According to the JSON RFC 8259:

8.1. Character Encoding

JSON text exchanged between systems that are not part of a closed
ecosystem MUST be encoded using UTF-8 [RFC3629].

Previous specifications of JSON have not required the use of UTF-8
when transmitting JSON text. However, the vast majority of JSON-
based software implementations have chosen to use the UTF-8 encoding,
to the extent that it is the only encoding that achieves
interoperability.

Implementations MUST NOT add a byte order mark (U+FEFF) to the
beginning of a networked-transmitted JSON text. In the interests of
interoperability, implementations that parse JSON texts MAY ignore
the presence of a byte order mark rather than treating it as an
error.

utf8json handles UTF-8 including BOM

@abrobston
Copy link

Have you considered Jil?

@russcam
Copy link
Contributor Author

russcam commented Feb 6, 2019

@abrobston Jil was a consideration. There are a few reasons why utf8json was pursued

  • the API is more similar to JSON.Net's, with the thinking that this will make the replacement process easier to perform
  • utf8json performance appears to be slightly better than Jil for the use case
  • Jil looks to be missing some features that are needed by the client e.g. IDictionary<TKey, TValue> implementations where TKey is not a string or enumeration
  • extending/modifying utf8json to fit the needs of the client looked to be a straightforward venture from the POC
  • Many types in NEST require custom deserialization routines, which utf8json supports but Jil does not appear to

More generally, the API surface of utf8json looks small enough to introduce types such as Span<T>, Memory<T> and pipelines for frameworks that support them, to improve serialization further.

@russcam
Copy link
Contributor Author

russcam commented Feb 8, 2019

An update on progress! Commits have been going into https://github.com/elastic/elasticsearch-net/tree/feature/utf8json-serializer branch, with only a few remaining failing unit tests and integration tests (these are expected right now and will be fixed).

I want to summarize the changes thus far, and also to itemize the remaining items that I'm aware of:

Breaking Changes

  • Dynamic code generation with Reflection.Emit

    utf8json uses Relfection.Emit to generate formatters for types. Reflection.Emit is not supported on all platforms e.g. UWP, Xamarin.iOS, Xamarin.Android.

  • DynamicResponse deserializes JSON arrays to List<object>

    SimpleJson deserializes JSON arrays to object[], but utf8json, through PrimitiveObjectFormatter, deserializes them to List<object>. Whilst this could be changed to call .ToArray() on List<object>, current thinking is that the change to returning List<object> would be preferred for allocation and performance reasons.

  • LazyDocument holds byte array

    LazyDocument holds the value in a byte array copied from the response bytes, as opposed to the current implementation that holds an internal JObject deserialized from bytes. This means that the bytes may represent indented JSON; care needs to be taken if serializing LazyDocument for an API that expects newline delimited JSON, such as bulk API.

Outstanding

  • Support PrettyJson in serialized requests when Formatting.Indented specified

    utf8json optimizes to generating formatters with no indentation, and prettifying JSON is left to reading the resulting serialized bytes and writing with indentation to another buffer.

  • Support extending NEST types by deriving and implementing properties

    Most formatters are using known interfaces to retrieve type/property information for the purposes of serialization. This approach will not support extending NEST types. Using the fallback IJsonFormatter<object> formatter will work, but does mean that a formatter is generated for each type.

  • Low level client exception structure should match SimpleJson

  • Double values should always be serialized with decimal point

    Floating point and decimal numbers that do not have fractional values are not serialized with decimal point by utf8json. This saves a few bytes, but would break Elasticsearch type mapping inference.

  • Source include utf8json in Elasticsearch.Net, and mark InternalsVisibleTo() for NEST and dynamic assemblies

    Currently added as a git submodule. Whilst source including makes pulling in upstream changes trickier, specific client customizations should be easier to make than if the serializer were in a separate assembly.

  • Dynamic assemblies should be strongly named

    Allows formatters within NEST to be internal and InternalsVisibleTo() dynamic assemblies

@russcam
Copy link
Contributor Author

russcam commented Apr 10, 2019

I've moved the outstanding issues into separate issues to keep their scope focused. I'm going to close this issue now as utf8json has been implemented into 7.x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants