-
Notifications
You must be signed in to change notification settings - Fork 544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cache: key mismatch makes cache inefficient #950
Comments
It sounds like the issue is that the layers are compressed in the registry, so they're pulled and cached as compressed (named by digest), then accessed by diffID, so they're always pulled and decompressed each time. A smarter cache would pull and cache the compressed layer by digest, then when it was accessed by diffID, lazily decompress and cache that by diffID. Then each time diffID is requested later, we would already have a decompressed version cached. The trick there is mapping an association between digest<->diffID, so that the (de)compressed version can be returned when only the other one is available. Unfortunately, simply symlinking between them won't do the right thing I think, because the layer contents by digest and diffID are different -- one's compressed and the other isn't. I do think we can be a lot smarter here, though. The cache package isn't the most-loved package in the repo by a longshot -- the fact that we weren't lazily caching until yesterday (#951) is a sign of that 😅 . We need to invest more time -- and possibly break the API -- to do smarter things in the future. |
One thing I've considered in the past (and would be a ~breaking change, maybe) would be to maintain the cache as an OCI image layout. We can store descriptors pointing to the blobs (but really, anything) in
{
"schemaVersion": 2,
"manifests": [
{
"mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
"size": 28565893,
"digest": "sha256:83ee3a23efb7c75849515a6d46551c608b255d8402a4d3753752b88e0dc188fa",
"annotations": {
"dev.ggcr.cache.diffid": "sha256:9f32931c9d28f10104a8eb1330954ba90e76d92b02c5256521ba864feec14009"
}
},{
"mediaType": "application/vnd.docker.image.rootfs.diff.tar",
"size": 75271168,
"digest": "sha256:9f32931c9d28f10104a8eb1330954ba90e76d92b02c5256521ba864feec14009",
"annotations": {
"dev.ggcr.cache.digest": "sha256:83ee3a23efb7c75849515a6d46551c608b255d8402a4d3753752b88e0dc188fa"
}
}
]
} These both reference the same logical thing, but the first one is gzipped. For access efficiency, maybe we want to maintain a separate mapping as well so that you don't have to parse JSON and scan (O(N)) |
One advantage of this would be that we could more easily ship the cache up to a registry, where it could be pulled/shared with others (not sure if that's compelling, but it's possible). A downside would be that every |
This issue is stale because it has been open for 90 days with no |
When using cache .Layers(), the cache is never used.
This is because the layers are retrieved by a different hash then they are stored.
I created a proof of concept/draft with a symlink from diffid -> digest, so Get (and Delete) accept both hashes, but the changes were larger than I expected, I'm not sure if this is the right approach.
https://github.com/google/go-containerregistry/compare/main...chlunde:cache-digest-and-diffid?expand=1
The text was updated successfully, but these errors were encountered: