Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Passing a Stream Utf8JsonWriter to a JsonSerializer.Serialize method results in Pending bytes being misreported #66102

Closed
Tracked by #77018
mattchidley opened this issue Mar 2, 2022 · 20 comments · Fixed by #102541
Labels
area-System.Text.Json bug in-pr There is an active PR which will close this issue when it is merged tenet-performance Performance related issue tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Milestone

Comments

@mattchidley
Copy link

mattchidley commented Mar 2, 2022

Description

When using JsonSerializer to serialize to a stream, if more than one custom converter is in play we are seeing that the internal buffer is not able to be written to the stream until the entirety of the object has been serialized to the buffer. This can result in significant memory usage and high allocations in ArrayPool.Rent and Resize called from PooledByteBufferWriter. It seems to me like the serialization state/context gets trapped when a second custom converter is in play both sync and async -- and we lose the ability to write to the stream and clear the buffer until the entire object has been serialized. This can be extremely problematic when serializing large objects with custom converters. If we had a 5gb object in memory, we'd effectively need a similarly large buffer just to hold the serialized representation of the object.

I am aware of this issue here -- and understand that a workaround can be to return IAsyncEnumerable in an async context... but not much info provided for sync.

My question is: in a sync context, is there a recommended approach to buffer/stream outside of what is posted on the docs? The docs recommend creating a writer for the stream and then passing the writer to the serializer -- this is also not ideal and can cause extremely high CPU if flushing after serializing using a custom converter, especially if there's a large result with many properties requiring custom conversion. I'm currently playing around with wrapping the stream with a MS's BCL BufferedStream, but would really appreciate any recommendations from your side to remedy these perf issues.

Configuration

Here is a sample where you can observe the buffer being filled and not cleared/flushed until the entire payload is serialized to the buffer:

using System.Text.Json;
using System.Text.Json.Serialization;

const string fileName = "testFile.json";
const int numTestObjects = 1000;
const int stringLength = 64;
Random random = new Random();

var fileStream = File.Create(fileName);
JsonSerializer.Serialize(fileStream, CreateManyTestObjects(), new JsonSerializerOptions { Converters = { new ParentConverter(), new ChildConverter() }});
File.Delete(fileName);

string RandomString(int length)
{
    const string chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
    return new string(Enumerable.Repeat(chars, length)
        .Select(s => s[random.Next(s.Length)]).ToArray());
}

IEnumerable<Parent> CreateManyTestObjects()
{
    for (var i = 0; i < numTestObjects; i++)
    {
        yield return new Parent(new Child(RandomString(stringLength)));
    }
}

class Parent
{
    public Child Child { get; }
    public Parent(Child child)
    {
        Child = child;
    }
}

class Child
{
    public string Prop { get; }
    public Child(string prop)
    {
        Prop = prop;
    }
}

class ParentConverter : JsonConverter<Parent>
{
    public override Parent Read(ref Utf8JsonReader reader, Type typeToConvert, JsonSerializerOptions options) => 
        throw new NotImplementedException();

    public override void Write(Utf8JsonWriter writer, Parent value, JsonSerializerOptions options)
    {
        writer.WriteStartObject();
        writer.WritePropertyName(nameof(value.Child));
        
        // this will call the converter
        JsonSerializer.Serialize(writer, value.Child, options);
        
        writer.WriteEndObject();
    }
}

class ChildConverter : JsonConverter<Child>
{
    public override Child Read(ref Utf8JsonReader reader, Type typeToConvert, JsonSerializerOptions options) => 
        throw new NotImplementedException();

    public override void Write(Utf8JsonWriter writer, Child value, JsonSerializerOptions options)
    {
        writer.WriteStartObject();
        writer.WritePropertyName(nameof(value.Prop));
        writer.WriteStringValue(value.Prop);
        writer.WriteEndObject();
    }
}
@mattchidley mattchidley added the tenet-performance Performance related issue label Mar 2, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Text.Json untriaged New issue has not been triaged by the area owner labels Mar 2, 2022
@ghost
Copy link

ghost commented Mar 2, 2022

Tagging subscribers to this area: @dotnet/area-system-text-json
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

When using JsonSerializer to serialize to a stream, if more than one custom converter is in play we are seeing that the internal buffer is not able to be written to the stream until the entirety of the object has been serialized to the buffer. This can result in significant memory usage and high allocations in ArrayPool.Rent and Resize called from PooledByteBufferWriter. It seems to me like the serialization state/context gets trapped when a second custom converter is in play both sync and async -- and we lose the ability to write to the stream and clear the buffer until the entire object has been serialized. This can be extremely problematic when serializing large objects with custom converters. If we had a 5gb object in memory, we'd effectively need a similarly large buffer just to hold the serialized representation of the object.

I am aware of this issue here -- and understand that a workaround can be to return IAsyncEnumerable in an async context... but not much info provided for sync.

My question is: in a sync context, is there a recommended approach to buffer/stream outside of what is posted on the docs? The docs recommend creating a writer for the stream and then passing the writer to the serializer -- this is also not ideal and can cause extremely high CPU if flushing after serializing using a custom converter. I'm currently playing around with wrapping the stream with a MS's BCL PooledBufferedStream, but would really appreciate any recommendations from your side to remedy these perf issues.

Configuration

Here is a sample where you can observe the buffer being filled and not cleared/flushed until the entire payload is serialized to the buffer:

`
using System.Text.Json;
using System.Text.Json.Serialization;

const string fileName = "testFile.json";
const int numTestObjects = 1000;
const int stringLength = 64;
Random random = new Random();

var fileStream = File.Create(fileName);
JsonSerializer.Serialize(fileStream, CreateManyTestObjects(), new JsonSerializerOptions { Converters = { new ParentConverter(), new ChildConverter() }});
File.Delete(fileName);

string RandomString(int length)
{
const string chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
return new string(Enumerable.Repeat(chars, length)
.Select(s => s[random.Next(s.Length)]).ToArray());
}

IEnumerable CreateManyTestObjects()
{
for (var i = 0; i < numTestObjects; i++)
{
yield return new Parent(new Child(RandomString(stringLength)));
}
}

class Parent
{
public Child Child { get; }
public Parent(Child child)
{
Child = child;
}
}

class Child
{
public string Prop { get; }
public Child(string prop)
{
Prop = prop;
}
}

class ParentConverter : JsonConverter
{
public override Parent Read(ref Utf8JsonReader reader, Type typeToConvert, JsonSerializerOptions options) =>
throw new NotImplementedException();

public override void Write(Utf8JsonWriter writer, Parent value, JsonSerializerOptions options)
{
    writer.WriteStartObject();
    writer.WritePropertyName(nameof(value.Child));
    
    // this will call the converter
    JsonSerializer.Serialize(writer, value.Child, options);
    
    writer.WriteEndObject();
}

}

class ChildConverter : JsonConverter
{
public override Child Read(ref Utf8JsonReader reader, Type typeToConvert, JsonSerializerOptions options) =>
throw new NotImplementedException();

public override void Write(Utf8JsonWriter writer, Child value, JsonSerializerOptions options)
{
    writer.WriteStartObject();
    writer.WritePropertyName(nameof(value.Prop));
    writer.WriteStringValue(value.Prop);
    writer.WriteEndObject();
}

}
`

Author: mattchidley
Assignees: -
Labels:

area-System.Text.Json, tenet-performance, untriaged

Milestone: -

@eiriktsarpalis
Copy link
Member

Actually this seems like a bug with Utf8JsonWriter when used with the relevant JsonSerializer.Serialize overload. Minimal reproduction:

using System;
using System.Collections.Generic;
using System.Text.Json;
using System.Text.Json.Serialization;

var options = new JsonSerializerOptions { DefaultBufferSize = 512 };
options.Converters.Add(new MyStringConverter());

using var stream = Console.OpenStandardOutput();
JsonSerializer.Serialize(stream, CreateManyTestObjects(), options);
//await JsonSerializer.Serialize(stream, CreateManyTestObjects(), options); // issue also impacts async method

IEnumerable<string> CreateManyTestObjects()
{
    int i = 0;
    while (true)
    {
        if (++i % 10_000 == 0)
        {
            System.Threading.Thread.Sleep(500);
        }
        yield return "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
    }
}

class MyStringConverter : JsonConverter<string>
{
    public override string Read(ref Utf8JsonReader reader, Type typeToConvert, JsonSerializerOptions options) =>
        throw new NotImplementedException();

    public override void Write(Utf8JsonWriter writer, string value, JsonSerializerOptions options)
    {
        writer.WriteStringValue(value);
        // JsonSerializer.Serialize(writer, value); // replace with this line to trigger bug
    }
}

When passing the Utf8JsonWriter instance to the nested Serialize method it seems like it's misreporting the pending bytes as committed:

image
which causes the ShouldFlush check in the enumerable converter to not fire. Compare with a regular converter:

image

cc @layomia @steveharter

@eiriktsarpalis eiriktsarpalis added bug and removed untriaged New issue has not been triaged by the area owner labels Mar 3, 2022
@eiriktsarpalis eiriktsarpalis added this to the 7.0.0 milestone Mar 3, 2022
@mattchidley
Copy link
Author

Just to add for anyone who might follow this -- a large enough property could cause an OutOfMemoryException when trying to resize the buffer from Utf8JsonWriter.Grow

As noted in my initial report, creating a BufferedStream implementation to flush the writer could be a workaround to prevent OOM. So instead of serializing directly to the stream, we would construct a Utf8JsonWriter of the bufferedstream and pass the writer to the serializer instead of the stream.

@eiriktsarpalis -- we're looking to Microsoft here to recommend a workaround or possible move up the bug fix timeline?

@eiriktsarpalis
Copy link
Member

possible move up the bug fix timeline?

We will try to get this fixed for .NET 7, but it doesn't meet the bar for .NET 6 servicing.

@mattchidley
Copy link
Author

mattchidley commented Apr 6, 2022

@eiriktsarpalis would this issue also be present in an async context? We noticed your comment here and are hoping that these are two separate issue fixes. It's important to stress here that in both sync and async contexts we can balloon a buffer enough to cause an out of memory crash.

@eiriktsarpalis
Copy link
Member

Hi @mattchidley, I believe the two issues are unrelated.

@krwq krwq modified the milestones: 7.0.0, 8.0.0 Jul 6, 2022
@krwq
Copy link
Member

krwq commented Jul 6, 2022

custom contract resolver should help since you should be able to avoid many of custom resolver usages

@mattchidley
Copy link
Author

I'm a little disgruntled that this case was pushed out to 8.0: #63795

@krwq can you advise on how you think custom resolver usages would help us avoid the memory buffering issues for custom conversion?

Say we have an IAsyncEnumerable of MyType which has a string property and requires custom conversion, if we want to serialize a large string we will balloon a buffer to do so.

I'm surprised this isn't higher priority considering the performance implications and the risk of application crashes.

@eiriktsarpalis
Copy link
Member

can you advise on how you think custom resolver usages would help us avoid the memory buffering issues for custom conversion?

It depends. What use case made you write a custom converter in the first place? It might be the case that you could change the contract instead (which would still use the streaming built-in converters).

@mattchidley
Copy link
Author

@eiriktsarpalis thanks, just reading a bit more.. I might be able to replace our troublesome custom converter with a contract resolver

@eiriktsarpalis
Copy link
Member

Here are the official docs in case they haven't been shared in this thread already:

https://learn.microsoft.com/en-us/dotnet/standard/serialization/system-text-json/custom-contracts

@mattchidley
Copy link
Author

mattchidley commented Dec 21, 2022

@eiriktsarpalis I don't think you understand the problem fully. I took another look at your repro above and it does not wholly represent the problem... I think there is another issue here:

One large property must be entirely buffered before it can be flushed. Here is your repro but adapted:

using System.Text.Json;

var options = new JsonSerializerOptions { };
Random random = new Random();

using var stream = Console.OpenStandardOutput();
await JsonSerializer.SerializeAsync(stream, CreateManyTestObjectsAsync(), options);

async IAsyncEnumerable<Child> CreateManyTestObjectsAsync()
{
    int i = 0;
    while (true)
    {
        yield return new Child { Prop = RandomString(50_000_000) };
    }
}

string RandomString(int length)
{
    const string chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
    return new string(Enumerable.Repeat(chars, length)
        .Select(s => s[random.Next(s.Length)]).ToArray());
}

class Child
{
    public string Prop { get; set; }
}

In this case we fill and resize the buffer in JsonSerializer.Write.Stream.cs WriteStreamAsync with as many characters as we can possibly generate, all the way up to OutOfMemoryException.

Please note there are no custom converters/JsonSerializerOptions. Am I missing something here?? We came up with a custom solution which works well for us in a sync context, but async we cannot clear the buffer and flush the stream because of how your internal buffer management works. WriteCore will never call a flush because it's writing to the internal buffer.

This is causing some significant issues. IMO async serialization in System.Text.Json is completely broken because of this, how can you expect people to use this??

@Shamus03
Copy link

Shamus03 commented Dec 21, 2022

@mattchidley I ran your latest repro code (with a tiny tweak to write to a null stream instead of stdout) and got an out of memory exception, so +1 to that.

But if I make one additional simple change (a call to await Task.Yield()), I was able to run this code for 20+ minutes without running out of memory:

using System.Text.Json;

var options = new JsonSerializerOptions { };
Random random = new Random();

using var stream = Stream.Null;
await JsonSerializer.SerializeAsync(stream, CreateManyTestObjectsAsync(), options);

async IAsyncEnumerable<Child> CreateManyTestObjectsAsync()
{
    while (true)
    {
        await Task.Yield();
        yield return new Child { Prop = RandomString(50_000_000) };
    }
}

string RandomString(int length)
{
    const string chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
    return new string(Enumerable.Repeat(chars, length)
        .Select(s => s[random.Next(s.Length)]).ToArray());
}

class Child
{
    public string Prop { get; set; }
}

I'm not claiming this is a fix or that you should use Task.Yield(), but it does change the behavior in an interesting way. I think without that yield, the framework never gets the chance to take the main thread and clean up its buffers.

@eiriktsarpalis
Copy link
Member

eiriktsarpalis commented Dec 21, 2022

@mattchidley I think what you're seeing is a result of Utf8JsonWriter not being able to stream large string values. There's a conversation about adding support for partial writes in #67337.

IMO async serialization in System.Text.Json is completely broken because of this

Surely that's exaggeration. I get that it can cause issues when you need to serialize strings whose lengths are in the order of tens of millions, but if anything that's a niche concern. In any case #67337 is tracking a fix for that.

@mattchidley
Copy link
Author

I appreciate the prompt responses, espcially considering the time of year -- thanks for that.

Based on the exchange we've had so far it seems like Microsoft doesn't think this is a problem and is not treating it like a priority. My case as well as the one you noted above were both submitted in Mar 2022.

Some questions:

  • Will this even be addressed in 8.0?
  • Is your official stance is that my report of the issues asynchronously serialize large objects is a niche concern?
  • Is/was there some special notes in the official docs that outlines the current limitation surrounding large object serialization?
  • What is your suggestion? Is it to sync serialize large objects with our custom stream implementation? What about for IAsyncEnumerable?

I'm not exaggerating. There are some serious implications to these issues that are going to cost lots of time and effort to work around/back out... we were sold this grand idea of System.Text.Json but the more we use of it the more problems we encounter.

@eiriktsarpalis
Copy link
Member

Based on the exchange we've had so far it seems like Microsoft doesn't think this is a problem and is not treating it like a priority. My case as well as the one you noted above were both submitted in Mar 2022.

I'm sorry if you think that's the case. We do take all reported issues seriously, but please do keep in mind that our backlog is substantial so we need to prioritize our efforts according to impact and business needs. This means that certain issues might remain unresolved for longer than otherwise desired.

Is your official stance is that my report of the issues asynchronously serialize large objects is a niche concern?

The buffering issue that you reported concerns large strings specifically, not large objects in general. My personal opinion is far from what somebody might consider "official", but yes, I do think that serializing strings containing tens of millions of characters isn't what one might call a common scenario. That doesn't mean we're not acknowledging the problem though, which is why we're still tracking #67337.

What is your suggestion?

I would recommend avoiding use of very large strings until the issue is resolved. This can be done either by chunking your data or extracting the raw data into a separate resource.

we were sold this grand idea of System.Text.Json but the more we use of it the more problems we encounter.

Rome wasn't built in one day.. STJ is a young library and Microsoft is invested in continually improving it.

@olgolo
Copy link

olgolo commented Jan 5, 2023

@mattchidley What workaround did you use in your sync code?

@mattchidley
Copy link
Author

@olgolo I described it briefly above -- but you could implement Stream and Flush during Write after a certain number of Bytes written. You could then pass this new Stream to the Utf8JsonWriter which gets used for serialization. This would then give you the opportunity to flush and clear while writing.

@eiriktsarpalis eiriktsarpalis changed the title System.Text.Json using significant memory when serializing with custom converters Passing a Stream Utf8JsonWriter to a JsonSerializer.Serialize method results in Pending bytes being misreported Jan 27, 2023
@eiriktsarpalis
Copy link
Member

Can confirm that the original issue as described in #66102 (comment) still occurs in .NET 7. This is a serious problem that can result in unexpected memory leaks and we should try to fix it.

For issues pertaining to large string serialization I would recommend upvoting and contributing to the discussion in #67337. For streaming support in user-defined converters I would recommend upvoting and contributing to the discussion in #63795.

@krwq krwq added the tenet-reliability Reliability/stability related issue (stress, load problems, etc.) label Jan 30, 2023
@eiriktsarpalis eiriktsarpalis modified the milestones: 8.0.0, Future May 31, 2023
@dotnet-policy-service dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label May 22, 2024
@BrennanConroy BrennanConroy modified the milestones: Future, 9.0.0 May 28, 2024
@BrennanConroy
Copy link
Member

Semi-accidentally fixed this in #102541

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Text.Json bug in-pr There is an active PR which will close this issue when it is merged tenet-performance Performance related issue tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants