Improved "GetPublishedAtUtcAsync" method efficiency #436

morended · 2023-06-28T17:09:24Z

Current implementation of "GetPublishedAtUtcAsync" is fetching all the metadata for a given package version, which is causing significant delay. To improve the latency, optimized "GetPublishedAtUtcAsync" method to fetch the uploadtime alone.

…lone from the cached json doc.

morended · 2023-06-28T17:19:14Z

/azp run

azure-pipelines · 2023-06-28T17:19:19Z

Commenter does not have sufficient privileges for PR 436 in repo microsoft/OSSGadget

gfs · 2023-06-28T17:21:10Z

/azp run

azure-pipelines · 2023-06-28T17:21:20Z

Azure Pipelines successfully started running 1 pipeline(s).

morended · 2023-06-28T17:56:51Z

/azp run

azure-pipelines · 2023-06-28T17:56:55Z

Commenter does not have sufficient privileges for PR 436 in repo microsoft/OSSGadget

gfs · 2023-06-28T18:15:25Z

/azp run

azure-pipelines · 2023-06-28T18:15:35Z

Azure Pipelines successfully started running 1 pipeline(s).

pmalmsten

This is generally straightforward - it gets straight to the point, and should run very fast. I'm curious whether you have measured the speed of this change before and after for a package having many thousands of versions, like say https://www.npmjs.com/package/@graphql-codegen/cli?

As a side note, I would point out that the implementations of GetPackageMetadataAsync are left without any improvement with this approach. That's not necesssarily a problem, but there is a certain amount of elegance in the approach taken by the prior implentation by leaning on GetPackageMetadataAsync to be the one thing that parses package metadata (whereas this approach duplicates some of that parsing logic a bit). That's not a dealbreaker - the approach in this PR simplifies things - but it's a tradeoff that might need to be revisited later.

However, there are other more significant points that we must address when we start thinking about caching:

The memory cache object employed by GetJsonCache has a size limit of about 8 MB, whereas the size of the JSON for graphql-codegen/cli is ~25 MB. So we likely need to increase the size limit of the cache, perhaps to say 100 MB. To go with that, we would likely need to bump the memory allocation for Terrapin containers up 50 to 100 additional MB.
When adding entries to the cache, we do not set an expiration time. As a result, objects may be cached indefinitely. This does not matter all that much for OSS Gadget CLIs that terminate after a few moments, but will matter a lot for Terrapin processes which will keep the running for hours or days at a time. We should probably specify a sliding expiration of something like 30 minutes and an absolute expiration of something like 6 hours to ensure that data in Terrapin stays fresh. It would be ideal to do this for all places where we set values in the cache.

Those are what come to mind first - getting caching right can be tricky, so I'll let you know if I think of anything else later. Let me know if you have any questions on the above.

src/Shared/PackageManagers/NPMProjectManager.cs

pmalmsten · 2023-06-30T18:33:49Z

We also discussed offline how NPM does not set the Content-Length header on HTTP responses (which causes GetJsonCache to signifigcantly underestimate the size of objects that are cached), and that we should fix the response size estimation in OSSGadget so that we just count how many bytes are in response bodies instead of trying to use an optional header.

Increased cache size. Added CacheInvalidation and CacheExpiration.

morended · 2023-06-30T20:29:16Z

This is generally straightforward - it gets straight to the point, and should run very fast. I'm curious whether you have measured the speed of this change before and after for a package having many thousands of versions, like say https://www.npmjs.com/package/@graphql-codegen/cli?

As a side note, I would point out that the implementations of GetPackageMetadataAsync are left without any improvement with this approach. That's not necesssarily a problem, but there is a certain amount of elegance in the approach taken by the prior implentation by leaning on GetPackageMetadataAsync to be the one thing that parses package metadata (whereas this approach duplicates some of that parsing logic a bit). That's not a dealbreaker - the approach in this PR simplifies things - but it's a tradeoff that might need to be revisited later.

However, there are other more significant points that we must address when we start thinking about caching:

The memory cache object employed by GetJsonCache has a size limit of about 8 MB, whereas the size of the JSON for graphql-codegen/cli is ~25 MB. So we likely need to increase the size limit of the cache, perhaps to say 100 MB. To go with that, we would likely need to bump the memory allocation for Terrapin containers up 50 to 100 additional MB.

When adding entries to the cache, we do not set an expiration time. As a result, objects may be cached indefinitely. This does not matter all that much for OSS Gadget CLIs that terminate after a few moments, but will matter a lot for Terrapin processes which will keep the running for hours or days at a time. We should probably specify a sliding expiration of something like 30 minutes and an absolute expiration of something like 6 hours to ensure that data in Terrapin stays fresh. It would be ideal to do this for all places where we set values in the cache.

Those are what come to mind first - getting caching right can be tricky, so I'll let you know if I think of anything else later. Let me know if you have any questions on the above.

As discussed offline, increased the cache limit and added cache expiration.

morended · 2023-06-30T20:35:06Z

We also discussed offline how NPM does not set the Content-Length header on HTTP responses (which causes GetJsonCache to signifigcantly underestimate the size of objects that are cached), and that we should fix the response size estimation in OSSGadget so that we just count how many bytes are in response bodies instead of trying to use an optional header.

I have changed this to get contentlength from the Http response body.

pmalmsten

Good progress, but a few things could still use tweaking.

src/Shared/PackageManagers/BaseProjectManager.cs

src/Shared/PackageManagers/PyPIProjectManager.cs

src/Shared/PackageManagers/BaseProjectManager.cs

pmalmsten

Latest changes look great! Thanks Mounika.

jpinz · 2023-07-06T16:54:19Z

/azp run

azure-pipelines · 2023-07-06T16:54:29Z

Azure Pipelines successfully started running 1 pipeline(s).

Updated GetPublishedTimeStamp Api to fetch the published time stamp a…

4ab7c56

…lone from the cached json doc.

fix tests.

3ce7ae3

gfs approved these changes Jun 28, 2023

View reviewed changes

morended changed the title ~~Improved "GetPublishedAtUtcAsync" method's efficiency~~ Improved "GetPublishedAtUtcAsync" method efficiency Jun 28, 2023

pmalmsten requested changes Jun 29, 2023

View reviewed changes

pmalmsten reviewed Jun 30, 2023

View reviewed changes

src/Shared/PackageManagers/NPMProjectManager.cs Show resolved Hide resolved

Addressed PR comments.

d508c52

Increased cache size. Added CacheInvalidation and CacheExpiration.

pmalmsten requested changes Jun 30, 2023

View reviewed changes

morended force-pushed the morended/cache-jsondoc branch from 1821588 to ee22d66 Compare June 30, 2023 23:18

Addressed PR comments.

2e7609a

morended force-pushed the morended/cache-jsondoc branch from ee22d66 to 2e7609a Compare July 5, 2023 16:39

pmalmsten approved these changes Jul 6, 2023

View reviewed changes

jpinz merged commit 8da1fcc into microsoft:main Jul 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved "GetPublishedAtUtcAsync" method efficiency #436

Improved "GetPublishedAtUtcAsync" method efficiency #436

morended commented Jun 28, 2023

morended commented Jun 28, 2023

azure-pipelines bot commented Jun 28, 2023

gfs commented Jun 28, 2023

azure-pipelines bot commented Jun 28, 2023

morended commented Jun 28, 2023

azure-pipelines bot commented Jun 28, 2023

gfs commented Jun 28, 2023

azure-pipelines bot commented Jun 28, 2023

pmalmsten left a comment

pmalmsten commented Jun 30, 2023

morended commented Jun 30, 2023

morended commented Jun 30, 2023

pmalmsten left a comment

pmalmsten left a comment

jpinz commented Jul 6, 2023

azure-pipelines bot commented Jul 6, 2023

Improved "GetPublishedAtUtcAsync" method efficiency #436

Improved "GetPublishedAtUtcAsync" method efficiency #436

Conversation

morended commented Jun 28, 2023

morended commented Jun 28, 2023

azure-pipelines bot commented Jun 28, 2023

gfs commented Jun 28, 2023

azure-pipelines bot commented Jun 28, 2023

morended commented Jun 28, 2023

azure-pipelines bot commented Jun 28, 2023

gfs commented Jun 28, 2023

azure-pipelines bot commented Jun 28, 2023

pmalmsten left a comment

Choose a reason for hiding this comment

pmalmsten commented Jun 30, 2023

morended commented Jun 30, 2023

morended commented Jun 30, 2023

pmalmsten left a comment

Choose a reason for hiding this comment

pmalmsten left a comment

Choose a reason for hiding this comment

jpinz commented Jul 6, 2023

azure-pipelines bot commented Jul 6, 2023