Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce FileStream allocations #49539

Closed
5 tasks done
adamsitnik opened this issue Mar 12, 2021 · 13 comments
Closed
5 tasks done

Reduce FileStream allocations #49539

adamsitnik opened this issue Mar 12, 2021 · 13 comments
Labels
area-System.IO tenet-performance Performance related issue
Milestone

Comments

@adamsitnik
Copy link
Member

adamsitnik commented Mar 12, 2021

P0

  • Profile FileStream memory allocations using Visual Studio Memory Profiler
  • get rid of the allocations (if there are any)
  • use IValueTaskSource suggestion from @stephentoub

P1

Most important is async IO (both buffering enabled and disabled) for overloads that use ValueTask. (@stephentoub please correct me if I am wrong)

Other aspects (sync read&write, seeking, setting length, position etc) are already non-allocating.

@adamsitnik adamsitnik added area-System.IO tenet-performance Performance related issue labels Mar 12, 2021
@adamsitnik adamsitnik added this to the 6.0.0 milestone Mar 12, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Mar 12, 2021
@adamsitnik adamsitnik removed the untriaged New issue has not been triaged by the area owner label Mar 12, 2021
@carlossanlop
Copy link
Member

carlossanlop commented Mar 15, 2021

@adamsitnik @jozkee

I am collecting memory allocation information for all the methods we call in the FileStream benchmarks in the performance repo. The code for my console app can be found here: https://github.com/carlossanlop/experiments/tree/AllocFS/FileStreamAllocations

I ran the app 3 times:

  1. Targeting .NET 5.0, for a baseline comparison.
  2. Targeting .NET 6.0 with this PR on top, and with DOTNET_SYSTEM_IO_USELEGACYFILESTREAM=0, to force using Legacy. This is to verify the behavior remains largely the same as in 5.0.
  3. Same as 2, but with DOTNET_SYSTEM_IO_USELEGACYFILESTREAM=1, to force using the new code. This is where we should focus our efforts.

Here is the spreadsheet where I am storing all the comparisons (read-only, anyone can view).

I will comment on each comparison when I'm done collecting the data.

@carlossanlop
Copy link
Member

@carlossanlop could you please include the new *NoBuffering benchmarks as well?

@adamsitnik Already did. Among my ReadAsync and WriteAsync argument combinations, I added a couple that were using bufferSize=1.

@carlossanlop
Copy link
Member

carlossanlop commented Mar 17, 2021

Most of the new allocations in my .NET 6.0 executions (with the FileStream Legacy code enabled or disabled), are coming from System.Diagnostics.Tracing.EventSource. I'd like to find a way to disable it so I can publish cleaner profiling results.

Regardless, I was able to find some large allocations that we could potentially improved in the new code with strategies (the Legacy code is not affected).

The largest allocations are happening inside ReadAsyncSlowPath and WriteAsyncSlowPath. They are noticeable when calling ReadAsync or WriteAsync in a loop, using a FS buffer size of 4KB, a user buffer size of 512B and FileOptions.Asynchronous.

Both ReadAsyncSlowPath and WriteAsyncSlowPath are called in their ReadAsync or WriteAsync methods, when the semaphoreLockTask does not finish successfully or we have some data that needs to be flushed:

When running Adam's rewrite with the legacy code, these were the top allocations:

Type Allocations
- System.IO.FileStreamCompletionSource 25,600
- System.Threading.Tasks.Task<System.Int32> 25,600
- System.SByte[] 2,483
- System.Threading.Tasks.Task.SetOnInvokeMres 2,247

When running Adam's rewrite with the refactored code, these were the top allocations:

Type Allocations
- System.Threading.Tasks.Task<System.Int32> 25,611
- System.IO.FileStreamCompletionSource 25,600
- System.Runtime.CompilerServices.AsyncTaskMethodBuilder<System.Int32>.AsyncStateMachineBox<<ReadAsyncSlowPath>d__39> 25,574
- System.SByte[] 2,920

Note:

  • The SByte allocations come from System.Diagnostics.Tracing.EventSource.

Here is the callstack:

Function Name Allocations Bytes Module Name
| + System.Runtime.CompilerServices.AsyncTaskMethodBuilder<int>.GetStateMachineBox(System.Threading.Tasks.Task<!0>) 25,574 5,319,392 System.Private.CoreLib.dll
|| + System.IO.BufferedFileStreamStrategy.ReadAsyncSlowPath() 25,574 5,319,392 System.Private.CoreLib.dll
||| + System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start<T>() 25,574 5,319,392 System.Private.CoreLib.dll
|||| + System.IO.BufferedFileStreamStrategy.ReadAsyncSlowPath(System.Threading.Tasks.Task, System.Memory<byte>, System.Threading.CancellationToken) 25,574 5,319,392 System.Private.CoreLib.dll
||||| + System.IO.BufferedFileStreamStrategy.ReadAsync(System.Memory<byte>, System.Threading.CancellationToken) 25,574 5,319,392 System.Private.CoreLib.dll
|||||| + System.IO.FileStream.ReadAsync(System.Memory<byte>, System.Threading.CancellationToken) 25,574 5,319,392 System.Private.CoreLib.dll
||||||| + MyNamespace.MyClass.ReadAsync() 25,574 5,319,392 MyProject.dll
|||||||| - System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, object) 25,463 5,296,304 System.Private.CoreLib.dll
|||||||| - System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start<T>() 100 20,800 System.Private.CoreLib.dll
|||||||| - System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(System.Threading.Thread, System.Threading.ExecutionContext, System.Threading.ContextCallback, object) 11 2,288 System.Private.CoreLib.dll

@stephentoub
Copy link
Member

The SByte allocations come from System.Diagnostics.Tracing.EventSource

Without seeing the trace, I can't be 100% sure, but I'm 99.9% sure these are coming from the JIT, basically 1 per method for which an inlining decision is made, and it's unavoidable.

System.Threading.Tasks.Task.SetOnInvokeMres

This comes from code synchronously blocking on a Task... most likely benchmark.net itself in blocking waiting for an async Task benchmark to complete.

System.Threading.Tasks.Task<System.Int32> and System.IO.FileStreamCompletionSource

These are the two that will be avoided by doing "use IValueTaskSource suggestion from @stephentoub".

@carlossanlop
Copy link
Member

carlossanlop commented Mar 17, 2021

Without seeing the trace, I can't be 100% sure, but I'm 99.9% sure these are coming from the JIT, basically 1 per method for which an inlining decision is made, and it's unavoidable.

The top SByte allocation has this callstack:

Function Name Allocations Bytes Module Name
| + System.Diagnostics.Tracing.EventSource.CreateManifestAndDescriptors(System.Type, string, System.Diagnostics.Tracing.EventSource, System.Diagnostics.Tracing.EventManifestOptions) 141 8,152 System.Private.CoreLib.dll
|| + System.Diagnostics.Tracing.EventSource.EnsureDescriptorsInitialized() 141 8,152 System.Private.CoreLib.dll
||| + System.Diagnostics.Tracing.EventSource.DoCommand(System.Diagnostics.Tracing.EventCommandEventArgs) 141 8,152 System.Private.CoreLib.dll
|||| + System.Diagnostics.Tracing.EventSource.Initialize(System.Guid, string, System.String[]) 141 8,152 System.Private.CoreLib.dll
||||| + System.Diagnostics.Tracing.NativeRuntimeEventSource..ctor() 141 8,152 System.Private.CoreLib.dll
|||||| + System.Diagnostics.Tracing.NativeRuntimeEventSource..cctor() 141 8,152 System.Private.CoreLib.dll
||||||| + System.Diagnostics.Tracing.EventListener..cctor() 141 8,152 System.Private.CoreLib.dll
|||||||| + System.Diagnostics.Tracing.EventListener.get_EventListenersLock() 141 8,152 System.Private.CoreLib.dll
||||||||| + System.Diagnostics.Tracing.EventSource.Initialize(System.Guid, string, System.String[]) 141 8,152 System.Private.CoreLib.dll
|||||||||| + System.Diagnostics.Tracing.RuntimeEventSource..ctor() 141 8,152 System.Private.CoreLib.dll
||||||||||| + System.StartupHookProvider.ProcessStartupHooks() 141 8,152 System.Private.CoreLib.dll

This comes from code synchronously blocking on a Task... most likely benchmark.net itself in blocking waiting for an async Task benchmark to complete.

I didn't use benchmark.net, I ran a console app and forced the consumption of the bits I compiled in my runtime repo with Adam's PR as the baseline.

Here's the code for the console app. Not sure what could be causing a block, since this is what I am running:

public static async Task Main()
{
    File.Delete(SourceFilePath);
    File.Delete(DestinationFilePath);
    File.WriteAllBytes(SourceFilePath, new byte[OneMibibyte]);
    File.WriteAllBytes(DestinationFilePath, new byte[OneMibibyte]);

    for (int i = 1; i <= 250; i++)
    {
        await ReadAsync(FourKibibytes, HalfKibibyte, FileOptions.Asynchronous);
    }
}


[MethodImpl(MethodImplOptions.NoInlining)]
public static async Task<long> ReadAsync(
    int bufferSize,     // Use: 1, FourKibibytes
    int userBufferSize, // Use: HalfKibibyte, FourKibibytes
    FileOptions options) // Use: None, Asynchronous
{
    CancellationToken cancellationToken = CancellationToken.None;
    byte[] buffer = new byte[userBufferSize];
    var userBuffer = new Memory<byte>(buffer);
    long bytesRead = 0;
    using (var fileStream = new FileStream(SourceFilePath, FileMode.Open, FileAccess.Read, FileShare.Read, bufferSize, options))
    {
        while (bytesRead < OneMibibyte)
        {
            bytesRead += await fileStream.ReadAsync(userBuffer, cancellationToken);
        }
    }

    return bytesRead;
}

These are the two that will be avoided by doing use IValueTaskSource suggestion

Looking forward to fix this!

@stephentoub
Copy link
Member

The top SByte allocation has this callstack:

Click the "Show Native Code" button and you'll see something very different:
image
These are the JIT-related allocations I was referring to. You can ignore them.

Not sure what could be causing a block

Ah. I just noticed you said those are from the legacy implementation. I expect they're coming from ReadAsync itself: #16341.

@adamsitnik
Copy link
Member Author

With #51363 ReadAsync is allocating the minimum, but WriteAsync for buffered FileStream is still allocating something:

obraz

It would be great to get rid of this allocation as well

@stephentoub
Copy link
Member

It would be great to get rid of this allocation as well

Presumably that's the byte[] inside of BufferedFileStreamStrategy. To hide that from such a trace, you'd need to either pool it or use native memory. How easy / hard that is will depend on how robust we want to be. We already have a semaphore gating access to async operations, so that same synchronization could be used to ensure a buffer isn't returned to a pool in Dispose while operations are erroneously in flight. Doing the same for the synchronous operations will require adding new synchronization to those code paths.

@adamsitnik
Copy link
Member Author

Presumably that's the byte[] inside of BufferedFileStreamStrategy.

The benchmark allocates 2,257,976 bytes while the bufferSize is 4096 so we still don't know where the missing 2,253,856 comes from (2,257,976 - 4096 - 3*8).

@stephentoub
Copy link
Member

so we still don't know where the missing 2,253,856 comes from

I'd guess task/state machine allocations from BufferedFileStreamStrategy.WriteAsync's async method. It should show up pretty easily in a profile.

@adamsitnik
Copy link
Member Author

It was #51489

@adamsitnik
Copy link
Member Author

Since we got rid of all of the allocations except of the buffer, for which we have agreed that we need a safer and more universal solution than just allowing the users to pass the buffer, I am going to close the issue.

@carlossanlop carlossanlop removed their assignment May 19, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Jun 18, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.IO tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

3 participants