Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cosmos 3.X | Address memory leak in Direct TCP transport client #9211

Conversation

David-Noble-at-work
Copy link

@David-Noble-at-work David-Noble-at-work commented Mar 19, 2020

Repro

  1. Run azure-sdk-for-java/sdk/cosmos direct tests
git checkout master
mvn -f pom.service.xml -Dgpg.skip -DargLine="-Dio.netty.leakDetectionLevel=paranoid -Dio.netty.leakDetection.targetRecords=255" -Pdirect verify | tee azure-cosmos.direct.output
  1. Run spring-data-cosmosdb tests after tweakingspring-data-cosmosdb/src/test/resources/application.properties:
git checkout master
mvn -Dskip.integration.tests=false -DargLine="-Dio.netty.leakDetectionLevel=paranoid -Dio.netty.leakDetection.targetRecords=255" clean package | tee spring-data-cosmosdb.output
  1. Find each occurrence of the word "LEAK" in the output files from the test runs.
grep grep --context=15 --line-number "LEAK" /path/to/azure-cosmos.direct.output /path/to/spring-data-cosmosdb.output

Observe

All tests pass and a number of leaks were detected. Examples:

  • RntbdRequestTimer

    2020-03-19 17:02:15,531       [TestNG-method=validateConsistentLSNForDirectHttpsClient-1] ERROR io.netty.util.ResourceLeakDetector - LEAK: HashedWheelTimer.release() was not called before it's garbage-collected. See https://netty.io/wiki/r
    counted-objects.html for more information.
    Recent access records:
    Created at:
        io.netty.util.HashedWheelTimer.<init>(HashedWheelTimer.java:275)
        io.netty.util.HashedWheelTimer.<init>(HashedWheelTimer.java:218)
        io.netty.util.HashedWheelTimer.<init>(HashedWheelTimer.java:197)
        io.netty.util.HashedWheelTimer.<init>(HashedWheelTimer.java:179)
        io.netty.util.HashedWheelTimer.<init>(HashedWheelTimer.java:136)
        com.azure.data.cosmos.internal.directconnectivity.rntbd.RntbdRequestTimer.<init>(RntbdRequestTimer.java:24)
        com.azure.data.cosmos.internal.directconnectivity.rntbd.RntbdServiceEndpoint$Provider.<init>(RntbdServiceEndpoint.java:336)
        com.azure.data.cosmos.internal.directconnectivity.RntbdTransportClient.<init>(RntbdTransportClient.java:67)
        com.azure.data.cosmos.internal.directconnectivity.RntbdTransportClient.<init>(RntbdTransportClient.java:73)
        com.azure.data.cosmos.internal.directconnectivity.StoreClientFactory.<init>(StoreClientFactory.java:37)
      com.azure.data.cosmos.internal.RxDocumentClientImpl.initializeDirectConnectivity(RxDocumentClientImpl.java:284)
      com.azure.data.cosmos.internal.RxDocumentClientImpl.init(RxDocumentClientImpl.java:276)
    
  • RntbdResponse.release

    2020-03-19 17:03:04,131       [cosmos-rntbd-nio-14-5] ERROR io.netty.util.ResourceLeakDetector - LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
    Recent access records:
    #1:
         com.azure.data.cosmos.internal.directconnectivity.rntbd.RntbdResponse.lambda$release$0(RntbdResponse.java:198)
         java.util.concurrent.atomic.AtomicInteger.accumulateAndGet(AtomicInteger.java:289)
         com.azure.data.cosmos.internal.directconnectivity.rntbd.RntbdResponse.release(RntbdResponse.java:184)
         com.azure.data.cosmos.internal.directconnectivity.rntbd.RntbdResponse.release(RntbdResponse.java:172)
         com.azure.data.cosmos.internal.directconnectivity.rntbd.RntbdRequestManager.channelRead(RntbdRequestManager.java:196)
    

Fix

This is a port from v2. See Azure/azure-cosmosdb-java#326 for a description of the change.

Test results

You will see in the before/after test results that this fix eliminates a memory leak in the Direct TCP stack. Performance numbers are provided for comparison with v4 master. There's some evidence in the comparison between v4 fix and v4 master that performance has improved some.

You will also see that for two test profiles (direct and emulator) there are some lingering leak detections. This is different than on the 2.6.X and 4.X branches where these leaks are not detected. I will get to the bottom of these issues in a second PR:

  • Path 1 (direct)

    com.azure.data.cosmos.internal.directconnectivity.rntbd.RntbdRequestTimer.<init>(RntbdRequestTimer.java:24)	
    com.azure.data.cosmos.internal.directconnectivity.rntbd.RntbdServiceEndpoint$Provider.<init>(RntbdServiceEndpoint.java:329)
    com.azure.data.cosmos.internal.directconnectivity.RntbdTransportClient.<init>(RntbdTransportClient.java:67)
    com.azure.data.cosmos.internal.directconnectivity.RntbdTransportClient.<init>(RntbdTransportClient.java:73)
    com.azure.data.cosmos.internal.directconnectivity.StoreClientFactory.<init>(StoreClientFactory.java:37)
    com.azure.data.cosmos.internal.RxDocumentClientImpl.initializeDirectConnectivity(RxDocumentClientImpl.java:284)
    
  • Path 2 (emulator)

      io.netty.buffer.AdvancedLeakAwareByteBuf.getByte(AdvancedLeakAwareByteBuf.java:154)
      io.netty.handler.codec.http.HttpObjectDecoder.decode(HttpObjectDecoder.java:351)
      io.netty.handler.codec.http.HttpClientCodec$Decoder.decode(HttpClientCodec.java:202)
      io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:493)
      io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:432)
      io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:271)
    
  • azure-cosmos before/after test results

    Client

    • East US
    • Standard F16s_v2 (16 vcpus, 32 GiB memory)
    • Linux appserver-lin-3 5.0.0-1032-azure 34-Ubuntu SMP Mon Feb 10 19:37:25 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

    Cosmos

    Test Script

    set -o errexit -o nounset
    mvn clean install
    mkdir -p azure-cosmos.results
    
    for profile in direct e2e emulator examples fast long non-emulator; do
        timeout 55m mvn -P$profile -DargLine="-Dio.netty.leakDetectionLevel=paranoid -Dio.netty.leakDetection.targetRecords=255" verify | tee azure-cosmos.results/Verify-P$profile.output
    done
  • spring-data-cosmosdb before/after test results

    Client

    • Dev machine on 100 gb/s WiFi link
    • MacBook Pro 2019 8-core 64 GB RAM
    • Darwin Danoble-MBP-6.thenobles.us 19.3.0 Darwin Kernel Version 19.3.0: Thu Jan 9 20:58:23 PST 2020; root:xnu-6153.81.5~1/RELEASE_X86_64 x86_64 i386 MacBookPro16,1 Darwin

    Cosmos

    Test Script

    mkdir ~/spring-data-cosmosdb.results
    git checkout master
    mvn clean install -Dskip.integration.tests=false -DargLine="-Dio.netty.leakDetectionLevel=paranoid -Dio.netty.leakDetection.targetRecords=255" 2>&1 | tee ~/spring-data-cosmosdb.results/azure-cosmos-3.7.0.output
    git checkout issue/azure-cosmos/memory-leak
    mvn clean install -Dskip.integration.tests=false -DargLine="-Dio.netty.leakDetectionLevel=paranoid -Dio.netty.leakDetection.targetRecords=255" 2>&1 | tee ~/spring-data-cosmosdb.results/azure-cosmos-3.8.0-beta.1.output

    ** Instructions for viewing before/after test results **

    • Download tar ball, spring-data-cosmosdb.results.tar.gz, and then

    • From the directory containing the downloaded tar ball:

       mkdir spring-data-cosmosdb.results
       cd spring-data-cosmosdb.results
       tar -xf ../spring-data-cosmosdb.results.results.tar.gz
       grep -r --files-with-matches "LEAK" .
      
    • Notice this grep output (from the before test runs)

      ./azure-cosmos.results.before/Verify-Pdirect.output
      ./azure-cosmos.results.before/Verify-Pnon-emulator.output
      ./azure-cosmos.results.before/Verify-Pfast.output
      ./azure-cosmos.results.before/Verify-Pexamples.output
      ./azure-cosmos.results/azure.cosmos-3.7.0.output
      
    • Also notice this grep output (from the after test runs)

      ./azure-cosmos.results.after/Verify-Pdirect.output
      ./azure-cosmos.results.after/Verify-Pemulator.output
      
  • Performance numbers

    Client

    • East US
    • Standard F16s_v2 (16 vcpus, 32 GiB memory)
    • Linux appserver-lin-3 5.0.0-1032-azure 34-Ubuntu SMP Mon Feb 10 19:37:25 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

    Cosmos

image

image

@David-Noble-at-work David-Noble-at-work self-assigned this Mar 19, 2020
@David-Noble-at-work David-Noble-at-work added Cosmos cosmos:v3-item cosmos:v4-item Indicates this feature will be shipped as part of V4 release train labels Mar 19, 2020
@David-Noble-at-work
Copy link
Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@David-Noble-at-work David-Noble-at-work changed the title DO NOT MERGE | Address memory leak in Direct TCP transport client DO NOT MERGE | Cosmos 3.X | Address memory leak in Direct TCP transport client Mar 19, 2020
@David-Noble-at-work David-Noble-at-work changed the title DO NOT MERGE | Cosmos 3.X | Address memory leak in Direct TCP transport client Cosmos 3.X | Address memory leak in Direct TCP transport client Mar 23, 2020
@David-Noble-at-work David-Noble-at-work marked this pull request as ready for review March 24, 2020 00:26
Copy link
Contributor

@moderakh moderakh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Member

@kushagraThapar kushagraThapar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kushagraThapar kushagraThapar merged commit 72d66ac into Azure:cosmos_v3_3.7.0 Mar 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cosmos:v4-item Indicates this feature will be shipped as part of V4 release train Cosmos
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants