Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache: key mismatch makes cache inefficient #950

Open
chlunde opened this issue Feb 24, 2021 · 4 comments
Open

cache: key mismatch makes cache inefficient #950

chlunde opened this issue Feb 24, 2021 · 4 comments

Comments

@chlunde
Copy link

chlunde commented Feb 24, 2021

When using cache .Layers(), the cache is never used.

This is because the layers are retrieved by a different hash then they are stored.


I created a proof of concept/draft with a symlink from diffid -> digest, so Get (and Delete) accept both hashes, but the changes were larger than I expected, I'm not sure if this is the right approach.

https://github.com/google/go-containerregistry/compare/main...chlunde:cache-digest-and-diffid?expand=1

@imjasonh
Copy link
Collaborator

It sounds like the issue is that the layers are compressed in the registry, so they're pulled and cached as compressed (named by digest), then accessed by diffID, so they're always pulled and decompressed each time.

A smarter cache would pull and cache the compressed layer by digest, then when it was accessed by diffID, lazily decompress and cache that by diffID. Then each time diffID is requested later, we would already have a decompressed version cached.

The trick there is mapping an association between digest<->diffID, so that the (de)compressed version can be returned when only the other one is available.

Unfortunately, simply symlinking between them won't do the right thing I think, because the layer contents by digest and diffID are different -- one's compressed and the other isn't. I do think we can be a lot smarter here, though.

The cache package isn't the most-loved package in the repo by a longshot -- the fact that we weren't lazily caching until yesterday (#951) is a sign of that 😅 . We need to invest more time -- and possibly break the API -- to do smarter things in the future.

@jonjohnsonjr
Copy link
Collaborator

One thing I've considered in the past (and would be a ~breaking change, maybe) would be to maintain the cache as an OCI image layout. We can store descriptors pointing to the blobs (but really, anything) in index.json and use annotations to help with things like this, e.g.:

index.json

{
  "schemaVersion": 2,
  "manifests": [
    {
      "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
      "size": 28565893,
      "digest": "sha256:83ee3a23efb7c75849515a6d46551c608b255d8402a4d3753752b88e0dc188fa",
      "annotations": {
        "dev.ggcr.cache.diffid": "sha256:9f32931c9d28f10104a8eb1330954ba90e76d92b02c5256521ba864feec14009"
      }
    },{
      "mediaType": "application/vnd.docker.image.rootfs.diff.tar",
      "size": 75271168,
      "digest": "sha256:9f32931c9d28f10104a8eb1330954ba90e76d92b02c5256521ba864feec14009",
      "annotations": {
        "dev.ggcr.cache.digest": "sha256:83ee3a23efb7c75849515a6d46551c608b255d8402a4d3753752b88e0dc188fa"
      }
    }
  ]
}

These both reference the same logical thing, but the first one is gzipped. For access efficiency, maybe we want to maintain a separate mapping as well so that you don't have to parse JSON and scan (O(N)) index.json to determine if an equivalent key exists, but this would be a good place to start (and is portable).

@imjasonh
Copy link
Collaborator

One thing I've considered in the past (and would be a ~breaking change, maybe) would be to maintain the cache as an OCI image layout. We can store descriptors pointing to the blobs (but really, anything) in index.json and use annotations to help with things like this

One advantage of this would be that we could more easily ship the cache up to a registry, where it could be pulled/shared with others (not sure if that's compelling, but it's possible).

A downside would be that every cache.Put would open index.json, decode JSON, and maybe write a modified index.json and the blob file contents. Every cache.Get would at least open and decode index.json. Compared to os.Stat and os.Create. There's also concurrent read/write concerns to consider, in both cases. File contents probably aren't your best solution in that case, but the Cache interface should let you do anything you want if you care.

@github-actions
Copy link

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Keep fresh with the 'lifecycle/frozen' label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants