cache: key mismatch makes cache inefficient #950

chlunde · 2021-02-24T21:29:22Z

When using cache .Layers(), the cache is never used.

This is because the layers are retrieved by a different hash then they are stored.

I created a proof of concept/draft with a symlink from diffid -> digest, so Get (and Delete) accept both hashes, but the changes were larger than I expected, I'm not sure if this is the right approach.

https://github.com/google/go-containerregistry/compare/main...chlunde:cache-digest-and-diffid?expand=1

imjasonh · 2021-02-25T15:12:10Z

It sounds like the issue is that the layers are compressed in the registry, so they're pulled and cached as compressed (named by digest), then accessed by diffID, so they're always pulled and decompressed each time.

A smarter cache would pull and cache the compressed layer by digest, then when it was accessed by diffID, lazily decompress and cache that by diffID. Then each time diffID is requested later, we would already have a decompressed version cached.

The trick there is mapping an association between digest<->diffID, so that the (de)compressed version can be returned when only the other one is available.

Unfortunately, simply symlinking between them won't do the right thing I think, because the layer contents by digest and diffID are different -- one's compressed and the other isn't. I do think we can be a lot smarter here, though.

The cache package isn't the most-loved package in the repo by a longshot -- the fact that we weren't lazily caching until yesterday (#951) is a sign of that 😅 . We need to invest more time -- and possibly break the API -- to do smarter things in the future.

jonjohnsonjr · 2021-02-25T17:20:00Z

One thing I've considered in the past (and would be a ~breaking change, maybe) would be to maintain the cache as an OCI image layout. We can store descriptors pointing to the blobs (but really, anything) in index.json and use annotations to help with things like this, e.g.:

index.json

{
  "schemaVersion": 2,
  "manifests": [
    {
      "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
      "size": 28565893,
      "digest": "sha256:83ee3a23efb7c75849515a6d46551c608b255d8402a4d3753752b88e0dc188fa",
      "annotations": {
        "dev.ggcr.cache.diffid": "sha256:9f32931c9d28f10104a8eb1330954ba90e76d92b02c5256521ba864feec14009"
      }
    },{
      "mediaType": "application/vnd.docker.image.rootfs.diff.tar",
      "size": 75271168,
      "digest": "sha256:9f32931c9d28f10104a8eb1330954ba90e76d92b02c5256521ba864feec14009",
      "annotations": {
        "dev.ggcr.cache.digest": "sha256:83ee3a23efb7c75849515a6d46551c608b255d8402a4d3753752b88e0dc188fa"
      }
    }
  ]
}

These both reference the same logical thing, but the first one is gzipped. For access efficiency, maybe we want to maintain a separate mapping as well so that you don't have to parse JSON and scan (O(N)) index.json to determine if an equivalent key exists, but this would be a good place to start (and is portable).

imjasonh · 2021-02-25T19:03:03Z

One thing I've considered in the past (and would be a ~breaking change, maybe) would be to maintain the cache as an OCI image layout. We can store descriptors pointing to the blobs (but really, anything) in index.json and use annotations to help with things like this

One advantage of this would be that we could more easily ship the cache up to a registry, where it could be pulled/shared with others (not sure if that's compelling, but it's possible).

A downside would be that every cache.Put would open index.json, decode JSON, and maybe write a modified index.json and the blob file contents. Every cache.Get would at least open and decode index.json. Compared to os.Stat and os.Create. There's also concurrent read/write concerns to consider, in both cases. File contents probably aren't your best solution in that case, but the Cache interface should let you do anything you want if you care.

github-actions · 2021-05-27T01:31:09Z

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Keep fresh with the 'lifecycle/frozen' label.

github-actions bot added the lifecycle/stale label May 27, 2021

jonjohnsonjr added lifecycle/frozen and removed lifecycle/stale labels May 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache: key mismatch makes cache inefficient #950

cache: key mismatch makes cache inefficient #950

chlunde commented Feb 24, 2021

imjasonh commented Feb 25, 2021

jonjohnsonjr commented Feb 25, 2021

imjasonh commented Feb 25, 2021

github-actions bot commented May 27, 2021

cache: key mismatch makes cache inefficient #950

cache: key mismatch makes cache inefficient #950

Comments

chlunde commented Feb 24, 2021

imjasonh commented Feb 25, 2021

jonjohnsonjr commented Feb 25, 2021

imjasonh commented Feb 25, 2021

github-actions bot commented May 27, 2021