Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: io/fs, net/http: define interface for automatic ETag serving #43223

Closed
rsc opened this issue Dec 16, 2020 · 48 comments
Closed

proposal: io/fs, net/http: define interface for automatic ETag serving #43223

rsc opened this issue Dec 16, 2020 · 48 comments

Comments

@rsc
Copy link
Contributor

rsc commented Dec 16, 2020

In the discussion of io/fs and embed, a few people asked for automatic serving of ETag headers for static content, using content hashes. We ran out of time for that in Go 1.16, in part because it's not entirely clear what we should do.

Here is a proposal that may be full of holes.

First, in io/fs, define

package fs

// A ContentHashFile is a file that can return its own content hash efficiently.
type ContentHashFile interface {
	File
	
	// ContentHash returns a content hash of the file that uniquely
	// identifies the file contents, suitable for use as a cache key.
	// The returned hash should be text, such as hexadecimal or base64.
	//
	// ContentHash must NOT compute the hash of the file during the call.
	// That is, it must run in time O(1) not O(length of file).
	// If a content hash is not already available, ContentHash should
	// return an error rather than take the time to compute one.
	ContentHash() (string, error)
}

Second, in net/http, when serving a File, if it implements ContentHashFile and the ContentHash method succeeds and is alphanumeric (no spaces, no Unicode, no symbols, to avoid any kind of header problems), use that result as the default ETag.

@johnjaylward

This comment has been minimized.

@Merovius
Copy link
Contributor

I feel "the returned hash should be text, such as hexadecimal or base64" is either not restrictive enough, or too restrictive. I think it would be reasonable to use the identifier in a URL, but not all base64 encodings are usable as such. So, IMO, we should either

  1. Be more restrictive and specify what usages need to be fulfilled: "usable as an ETag header", "usable as a URL fragment" (I don't know if one of these implies the other), …
  2. Be more restrictive and specify what encoding should be used: "hex", "base64.RawURLEncoding"…
  3. Be less restrictive and specify that no encoding is used and the caller needs to re-encode based on intended usage: "the returned string may contain arbitrary bytes and must be encoded by the caller based on usage" (we might also use []byte as a signal, though string allows the interface to be allocation-free)

I think the last would be the least convenient in actual usage, but the most flexible.

#43076 is similar to this issue in the proposed solution and the addressed problem (so one or the other should be closed as duplicate, I think). It contains some discussion of other use-cases and how the API would need to be more complicated to serve them - notably, it was suggested that the hash should be able to be of arbitrary length and that it should be possible to specify the required hash-algorithm (useful if used e.g. in subresource integrity). I'm not fully on board with that being needed, but I wanted to bring it up.

@johnjaylward I think O(length of file) is clearer. Note, that if you want to be nitpicky, it is oxymoronic to "run in time O(1) not O(n)", because O(1)⊆O(n). So it'd need to be "run in time O(1) not ω(1)" (or something). Sometimes, clarity takes precedence over technical correctness.

@johnjaylward
Copy link

@johnjaylward I think O(length of file) is clearer. Note, that if you want to be nitpicky, it is oxymoronic to "run in time O(1) not O(n)", because O(1)⊆O(n). So it'd need to be "run in time O(1) not ω(1)" (or something). Sometimes, clarity takes precedence over technical correctness.

  • O(1) is constant time regardless of input size.
  • O(n) is linear time dependent on input size of n. In this case n would be the size of the file.

A concern that the function execution time is relative to the size of the file is not O(1) or ω(1). It would be some function of n. If n is non-linear, then you'd need to specify that as the class of function being compared (n log n for instance.)

Big O notation should be used correctly, or it should be avoided in favor of plain English. Using a mix of BigO and something else just feels wrong.

Since what I believe @rsc is trying convey is that Constant time should be guaranteed, I think plain English should be preferred.

Something like:

That is, it must run in constant time.

Optionally maybe specifying what isn't expected:

Execution time should not be relative to file size.

I'm also concerned about what this is expected to happen if the ContentHash returns an error before transport. Does the transport fail, or continue with no ETAG?

@rsc

This comment has been minimized.

@rsc
Copy link
Contributor Author

rsc commented Dec 16, 2020

I'm also concerned about what this is expected to happen if the ContentHash returns an error before transport. Does the transport fail, or continue with no ETAG?

It continues with no ETag.

@johnjaylward
Copy link

I am explicitly avoiding saying "constant time" because the cryptographers have redefined that term to mean something much more subtle that I do not intend here. If I had written "constant time" in a paragraph that already contains "content hash", we'd be dying on a different pointless hill.

Understood, it is more of a nitpick than anything worth dying over. I'd prefer to drop improper use of O notation from official language docs. I've not read all of Go's docs by any means, but I've not seen this use anywhere in my reading. If it is used elsewhere, then this is just one more.

@ianlancetaylor
Copy link
Contributor

Will it ever be necessary to know what kind of content hash function is being used by the underlying file system? For example, perhaps a security sensitive application would be willing to accept a SHA-256 hash but not an MD5 hash.

@rsc
Copy link
Contributor Author

rsc commented Dec 16, 2020

I don't know, and that's part of why we didn't propose anything for Go 1.16. ContentHash seems too vague at some level, but calling it ETag is too specific. We could call it something else - ContentID? - but it doesn't change the fact that the FS is in charge of which hash gets used, and the caller is expected not to be picky.

@tv42
Copy link

tv42 commented Dec 17, 2020

Either the acceptable character set of string has to be defined very clearly, resulting in either a one-size-fits-nobody solution or double encoding for most uses, or encoding should be left to the party that knows what encoding is needed (my preference).

ETag content (inside the quotes) character set is actually pretty different from most other uses:

etagc = "!" / %x23-7E / obs-text
obs-text = %x80-FF

@komuw
Copy link
Contributor

komuw commented Dec 23, 2020

maybe relevant; https://github.com/benbjohnson/hashfs

@dfinkel
Copy link
Contributor

dfinkel commented Dec 30, 2020

What if ContentHash() takes an argument indicating which digest/hash-algorithms it accepts?

Obviously a strawman, but how about making the interface definition something like (clearly modifying the example in the OP):

package fs

import "hash"

// HashAlgorithm implementations
type HashAlgorithm interface {
	NewHash() hash.Hash
	String() string
}

// A ContentHashFile is a file that can return its own content hash efficiently.
type ContentHashFile interface {
	File

	// ContentHash returns a content hash of the file that uniquely
	// identifies the file contents, suitable for use as a cache key.
	// The returned hash should be text, such as hexadecimal or base64.
	//
	// The HashAlgorithm-typed argument is a prioritized list of
	// hash-algorithm, most-proffered first. (a nil argument may return an
	// arbitrary, but consistent, implementation-defined content-hash)
	//
	// ContentHash must NOT compute the hash of the file during the call.
	// That is, it must run in time O(1) not O(length of file).
	// If a matching content hash is not already available, ContentHash
	// should return an error rather than take the time to compute one.
	ContentHash([]HashAlgorithm) (string, error)
}

Then, individual packages (crypto/md5, hash/crc32, golang.org/x/crypto/sha3) can have specific package-level instances of this interface, and implementations can do a quick equality check against the package-level variables.

I figure making the HashAlgorithm interface have a method that returns a hash.Hash makes it actually useful, and could make it easier to implement a more flexible interface that allows for calculating the hash of a file's contents at runtime, but uses a pre-computed value if it's available. (perhaps ContentHash should be called PrecomputedContentHash?)

It might make sense to define the HashAlgorithm interface hash package

Note: an enum-ish type would be another option, but then individual implementations would be limited to just the types defined here.
I almost started writing an enum and the interface to provide flexibility, but then the fs package would have some method implementations that pull in every type that implements hash.Hash. It seems better to define an interface in the packages.

@Merovius
Copy link
Contributor

@dfinkel I think as #43076 proves, designing an API to do anything we could reasonably want from it is a very deep rabbit hole. I feel like we should at least first agree if this is a rabbit hole we want to go down, before we get ourselves stuck in it. That is, what are the use-cases we would consider hard requirements?

So far, this issue is titled "define interface for automatic ETag serving", so a priori that should be considered the only requirement. And for that, the interface you are suggesting is not just unnecessarily complex, but I would argue even counterproductive - it suggests that a ContentHashFile should use the passed in HashAlgorithms to compute the hash, when that's intentionally documented as forbidden in the original suggestion.

Of course ETag serving doesn't have to be the only requirement. But importantly, we should make a case for other use-cases to be required first, before trying to design an API for them.

@tv42
Copy link

tv42 commented Dec 30, 2020

@Merovius If this use case truly were ETags, then the function discussed here should do all the proper ETag formatting, including the double quotes.

Personally, I think that's a silly use case, by itself.

@sylr
Copy link

sylr commented Dec 31, 2020

HTTP client side caching is not a silly use case ...

@Merovius
Copy link
Contributor

@tv42 I tend to agree. And, to repeat, I'm not even saying we should limit ourselves to ETag serving. I don't know where, or how to draw the line. But I think a line needs to be drawn, because most of the APIs mentioned in #43076 seem unacceptably complex for the stdlib to me (though that might just be a personal taste).

FWIW, you mentioned Subresource-integrity as another use-case in that issue and I think that would be an interesting case to at least think about.

@dfinkel
Copy link
Contributor

dfinkel commented Dec 31, 2020

@Merovius thanks for the pointer to #43076, I missed it before.

I'll be honest, I don't actually care that much about ETags, but I do see end-to-end checksum protection as critical for storage systems at scale. The cryptographic hashes are great for this goal, but only part of the story. I have a strong preference for crc32c in many places because it's cheap to compute the CRC of the concatenation of two blobs that you already have the CRC for (and most x86 and ARM chips have instructions to accelerate CRC32c computations explicitly).

IMO, creating an interface within io/fs that only serves ETags is a wasted opportunity (and likely confusing going forward). It's a trivial use-case for much more powerful primitives, where ETag only cares about "a content identifier", most other interesting use-cases (e.g. end-to-end checksum protection) require knowing which one, and sometimes requesting a specific one. In many cases (e.g. GCS) there are multiple digests/hashes computed, and it's useful to have both (without having to construct a new fs.FS handle).

The reason I included the NewHash() hash.Hash method in my interface proposal was too make it a more useful interface that isn't just a named interface{} or yet another name for fmt.Stringer. Given that we'd want many packages to support it outside the stdlib, and all of them have implementations of hash.Hash, it seemed appropriate to make this interface a "factory". (I'm definitely open to discussing alternatives)

One thing to keep in mind is that availability of specific hashes for some filesystems is history-dependent. e.g. GCS (and S3) don't store an MD5 if you've used the "Compose" API to concatenate several objects. (although GCS still has a CRC, and S3 has some other value in its ETag)

@tv42
Copy link

tv42 commented Dec 31, 2020

@sylr Locking into doing nothing else but ETags would, in my opinion, be silly.

Embracing Subresource Integrity (https://w3c.github.io/webappsec-subresource-integrity/) would mean the return value should be (a set of) 1) hash algo 2) digest 3) optional options. If those are kept separate, then the ETag use case can just use the truncated digest as the verifier.

package fs

type Integrity struct {
        // Algorithm as defined by Content Security Policy Level 2, section 4.2
	Algorithm string
        Digest []byte
        // SRI options are not supported at this time, but may be added later?
}

// A ContentHashFile is a file that can return its own content hash efficiently.
type ContentHashFile interface {
	File
	
	// ContentHash returns a content hash of the file that uniquely
	// identifies the file contents, suitable for use as a cache key.
	//
	// ContentHash must NOT compute the hash of the file during the call.
	// That is, it must run in time O(1) not O(length of file).
	// If a content hash is not already available, ContentHash should
	// return an error rather than take the time to compute one.
        //
        // Callers must not mutate the returned values.
	ContentHash() ([]Integrity, error)
}

@ianlancetaylor
Copy link
Contributor

As far as I know, the file systems that make a hash code available only make a single hash code available. If that is true, then I don't see a good argument for requesting a specific hash algorithm, or a set of acceptable hash algorithms. It would suffice to have some way for the file system to report which hash algorithm it is using.

I don't know of any current listing of hash algorithms. So I would tentatively recommend using package paths.

// A ContentHashFile is a file that can return its own content hash efficiently.
type ContentHashFile interface {
	File
	
	// ContentHash returns a content hash of the file that uniquely
	// identifies the file contents, suitable for use as a cache key.
	// The returned hash should be text, such as hexadecimal or base64.
	// The returned algorithm should be a package path, preferably in the
	// standard library, that identifies the hash algorithm being used; for example:
	// "hash/crc32" or "crypto/sha256".
	//
	// ContentHash must NOT compute the hash of the file during the call.
	// That is, it must run in time O(1) not O(length of file).
	// If a content hash is not already available, ContentHash should
	// return an error rather than take the time to compute one.
	ContentHash() (hash, algorithm string, error)
}

@tv42
Copy link

tv42 commented Dec 31, 2020

@ianlancetaylor If you force everything to only provide just one hash digest, it's impossible to gradually migrate to another hash algorithm. You can't switch everything over to the new one when the clients don't have the support, but you can't provide the new hash along with the old hash if the API is forcing just one answer.

GCS seems to have CRC32C and MD5 at this time, and has a mechanism for adding more if they wanted to: https://cloud.google.com/storage/docs/hashes-etags https://cloud.google.com/storage/docs/xml-api/reference-headers#xgooghash

Here's AWS S3 talking about MD5 and a tree hash based on SHA256: https://docs.aws.amazon.com/amazonglacier/latest/dev/checksum-calculations.html

Min.IO is S3-compatible so they store MD5s, but it also uses the faster HighwayHash for integrity protection: https://min.io/product/overview

The world will need to move beyond MD5 at some point.

@ianlancetaylor
Copy link
Contributor

Thanks. If file systems are already storing and providing multiple different kinds of hashes, then I was mistaken.

I guess one question would be whether it is ever important for users of this interface to ask for multiple hashes. When would you not want the most secure version?

And another question would be whether package paths are a reasonable way to describe the hash.

@Merovius
Copy link
Contributor

Merovius commented Jan 1, 2021

@ianlancetaylor Personally, I don't like the idea of using package paths to describe the hash. They are easy to typo and can't be compiler-verified. I have suggested to use package-scoped values instead. That is, you'd have e.g. crypto/sha256.Handle (no idea what a good name would be) be a singleton value that can be passed to uniquely identify sha256. Third party packages can easily define their own handle that can be passed around.

Of course, the downside is that this requires an import to actually use the handle. I'm not sure how big of a problem that is though - IMO an extra import only has an added cost, if you link an otherwise unused package into the binary and I would argue that if you want to use sha256 or if you are able to handle/compute sha256, you already use it. But I might be overlooking something.

@thomasf
Copy link

thomasf commented Jan 1, 2021

In the case of serving content from a AWS S3 you probably just want to straight up use their premade ETag which differs depend on how the files are stored:

AWS DOCS:

The entity tag represents a specific version of the object. The ETag reflects changes only to the contents of an object, not its metadata. The ETag may or may not be an MD5 digest of the object data. Whether or not it is depends on how the object was created and how it is encrypted as described below:

  • Objects created through the AWS Management Console or by the PUT Object, POST Object, or Copy operation:
    • Objects encrypted by SSE-S3 or plaintext have ETags that are an MD5 digest of their data.
    • Objects encrypted by SSE-C or SSE-KMS have ETags that are not an MD5 digest of their object data.
  • Objects created by either the Multipart Upload or Part Copy operation have ETags that are not MD5 digests, regardless of the method of encryption.

@thomasf
Copy link

thomasf commented Jan 1, 2021

I don't know but maybe an etag specific interface specifically indended for use by http.FS in the standard library wouldn't be super bad idea.

@dfinkel
Copy link
Contributor

dfinkel commented Jan 1, 2021

@thomasf I think an ETag-specific interface within io/fs would be confusingly (and frustratingly) narrow.

As some here have pointed out, ETag is a relatively trivial use-case, since it only needs a hash, but doesn't care which one, as long as it's consistent. (strictly speaking it doesn't even need to be a content-hash, so long as it changes with the content -- a version number or commit-hash might suffice in some cases, but that's beside the point because there's already mtime handling in net/http)

There are important use-cases involving specific content-hashes, and it seems much better to have a function in net/http to create an ETag from any fs.File implementation that implements fs.ContentHashFile.

If a fs.ContentHashFile implementation wants to return a specific value for ETag, we can have a net/http.EtagHandle for the hash-type (if we're following a variant of what @Merovius and I have proposed). -- then our net/http.FileETag() could pick something appropriate and do correct encoding, but all other use-cases that care which digest/hash is used (or need a specific one) can use the fs.ContentHashFile interface, and don't need to think about ETags at all. (I imagine that in the absence of an etag-specific hash, FileETag() would pick the strongest digest available (and skip over known-collision-prone hashes like CRCs, FNVs, etc.)

Note: I don't have a use-case for exporting net/http.FileETag() so it might just be part of one of the methods of net/http.FileServer.

@tv42
Copy link

tv42 commented Jan 3, 2021

@ianlancetaylor

I guess one question would be whether it is ever important for users of this interface to ask for multiple hashes. When would you not want the most secure version?

When clients verify the hash, when it's not used as just an opaque identifier.
Subresource Integrity is an example.

Consider: Most of current clients implement hash algo A, which is starting to show weaknesses. The crypto community is starting to feel good about hash algo B. You cannot just start serving only B digests, as most current clients wouldn't know what to do with them. Hence you provide A and B, and wait for clients to implement B before dropping A.

@rsc
Copy link
Contributor Author

rsc commented Jan 6, 2021

It sounds like there are two concerns.

  1. The API as proposed does not let the caller request a particular implementation.
  2. The API as proposed does not let the implementation say which hash it used. Callers may care so they can double-check the result or confirm that it's one that they will accept (although that latter case bleeds into 1 a little).

For (1), I think that's a different proposal. There will be callers who don't care about the implementation. They just want any hash the file system has available. Like HTTP does for ETags. That's the focus of this proposal. I don't believe we have to engineer it more than that.

For (2), it seems like maybe at least parts of the world are converging on the syntax used by Subresource Integrity
so maybe we should strongly encourage implementations to return a string in the form algo-base64, where algo is encouraged to be "sha256", "sha384", or "sha512". (I don't really want to link to the W3C spec from the fs package but the format is simple enough.)

That makes the result self-identifying and it's still useful as an ETag (arguably more useful). So the interface would be:

// A ContentHashFile is a file that can return its own content hash efficiently.
type ContentHashFile interface {
	File
	
	// ContentHash returns a content hash of the file that uniquely
	// identifies the file contents, suitable for use as a cache key.
	// The returned hash should be of the form algorithm-base64,
	// and implementations are encouraged to use sha256, sha384, or sha512
	// as the algorithm, for interoperability with other systems.
	//
	// ContentHash must NOT compute the hash of the file during the call.
	// That is, it must run in time O(1) not O(length of file).
	// If a content hash is not already available, ContentHash should
	// return an error rather than take the time to compute one.
	ContentHash() (string, error)
}

Thoughts?

@tv42
Copy link

tv42 commented Jan 7, 2021

  1. The API as proposed does not let the implementation return multiple hashes. That's needed for e.g. SRI transitions to new hash algorithms, or to e.g. let clients prefer hashes that are cheaper to verify (say, blake3).

Also, just "looks to similar to SRI format" doesn't allow one to pass through the returned value to SRI. If the algo strings aren't what the SRI registry wants, there's going to be a remapping/standard enforcement layer anyway, for anyone who's a careful programmer.

Combined with the "one winner only" concept, SRI's limited hash selection means you might not end up with a SRI-compatible hash digest at all.

@patrickmcnamara
Copy link

patrickmcnamara commented Jan 10, 2021

Is there a reason the proposed interface is very focused on hashes of file content?

ETags can be generated by any means as long as they uniquely identify a specific version of a file, as per RFC7232. The RFC also specifically states that there is no need for the client to know how the ETag is constructed.

I don’t think the interface should be limited to content hashes as some other types of ETag are common, simple and useful, such as timestamps or even just revision numbers. ETags need not be unique across different files. To cater for ETags specifically, I think the interface should allow more than just content hashes. For example, VersionedFile with a Version() (string, error) method (these are bad names but get the idea across).

If this interface is going to be used more generally and for purposes other than ETags, I don’t think that should necessarily limit its originally proposed purpose for ETags.

As well as that, this proposal only allows for ETags when serving Files but having something that works for io.Reader could be useful too although that might be outside of the scope of this proposal.

@rsc
Copy link
Contributor Author

rsc commented Jan 13, 2021

@patrickmcnamara It's hard to produce unique identifiers across multiple servers and multiple builds of a program. The one way we know that works very well is content hashes. Also we are trying to design something that programs might want for reasons other than ETag generation.

@rsc
Copy link
Contributor Author

rsc commented Jan 13, 2021

@tv42 Your (3) I was including as part of (1), at least as far as the answer: that's a more complex, different interface. This is the simple interface meant to be used for detecting file changes and uniquely identifying content with some reasonable hash. The format change gives a way for implementations and consumers to understand which one is being used.

So as far as "it's impossible to gradually migrate to another hash algorithm", that's out of scope here. And for the record, CRC32C should not be a valid implementation of ContentHash.

@rsc
Copy link
Contributor Author

rsc commented Jan 13, 2021

Are there any other objections to #43223 (comment)?
Is this important enough that we should do something at all?

Would people object to dropping this entirely and not setting ETag automatically for embedded content served over net/http?

@rsc
Copy link
Contributor Author

rsc commented Jan 13, 2021

A third-party library can provide a wrapper around http.FS that adds ETags by walking the fs.FS and pre-computing all the hashes and then using them to set ETags. The amount of embedded data is unlikely to be large enough that the pre-computation will take much time at all. (SHA256 runs at 350 MB/s on my laptop.) That would provide content-hash ETags still without any explicit "go generate" step.

Given that, here is a counter-proposal for thumbs up/thumbs down reactions:

Drop this proposal and let third-party code fill this need.

Thumbs up this comment if you think we should do nothing and abandon this issue. Thumbs down if you think it's important to do something here in the standard library. Thanks!

@Merovius
Copy link
Contributor

ETags where brought up before the embed-proposal was accepted and the response was "we will make sure that http caching will work". It seems unfortunate to walk that back now, after the proposal has been accepted.

It also feels inelegant to me, to simply discard the hash information the compiler already adds (though TBF, I don't know why it adds it, it doesn't seem to be used, so maybe that is just an artifact of the assumption that we'd want to add ETags).

@mkilpatrick
Copy link

Not directly related to Go, but is tangential -

http.FileServer will not honor an ETag set via a wrapping http.Handler if the FileInfo's ModTime is not the 0 time. If you compile your Go code with Bazel the ModTime will get set to 1/1/2000 due to how Bazel deals with reproducible builds. bazelbuild/bazel#1299

@hherman1
Copy link

I don't think its super important to do in the standard library, but I think it would be a nice-to-have. I think if the price of this feature is a new public interface in the standard library "ContentHashFile" then it is not worth it. If this can be added without any new API, then I think its almost a pure win.

@rsc
Copy link
Contributor Author

rsc commented Jan 20, 2021

@Merovius The hash was there just for this issue. It is unused by package embed. And it would be easy to remove and reclaim those 16 bytes.

@rsc
Copy link
Contributor Author

rsc commented Jan 20, 2021

FWIW, I did say that we would figure out how to make ETags work, but part of figuring it out is determining the design, through discussions like this one. And it is starting to sound like the right way to make it work is to suggest using an ETag-adding layer as described in #43223 (comment).

That comment has slightly more thumbs up (leave for third-party packages) than thumbs down (do something in std library) right now.

Will leave this open, but we don't seem to be converging on any plan forward in the standard library.

@rsc
Copy link
Contributor Author

rsc commented Jan 27, 2021

I see the thumbs downs on my previous comment, and I apologize that we don't know what to do.

But it seems clear that we don't know what to do. So we don't have a consensus about what to do, which usually means decline. That's probably what we should do - decline - and use embed for a while and see what ideas people come up with.

@tv42
Copy link

tv42 commented Jan 28, 2021

All I have to say is #41191 (comment) #35950 (comment)

@rsc
Copy link
Contributor Author

rsc commented Feb 3, 2021

@tv42, acknowledged (including in my comment above). We wanted to do this but do not have a good design. That's how it goes sometimes. The right design will probably come with time.

@rsc
Copy link
Contributor Author

rsc commented Feb 3, 2021

Based on the discussion above, this proposal seems like a likely decline.
— rsc for the proposal review group

@rsc
Copy link
Contributor Author

rsc commented Feb 10, 2021

No change in consensus, so declined.
— rsc for the proposal review group

@oliverpool
Copy link

I had an idea after reading this discussion, but I am not sure if it is good enough to open a new proposal.
I would greatly appreciate any feedback


Maybe the signature of ContentHash could be changed to: ContentHash() []string.

  1. this allows to return multiple hashes (so that we have possibility to gradually change the hash - for instance to consider client support)
  2. the order of the returned hash can reflect a preference (the first one should be preferred if the caller does not care about the algorithm - for instance for ETag)
  3. since this slice should be pre-computed, there is no need to return an error : simply return nil if no hash has been pre-computed.

I would suggest simply letting embed.FS expose the already computed hash for now and move the topic of controlling which hashes are computed to another discussion (centered on embed.FS) - except if there is an obvious way, which I don't know.

I think this addresses "it's impossible to gradually migrate to another hash algorithm", which I understood was the main blocker.

// A ContentHashFile is a file that can return its own content hashes efficiently.
type ContentHashFile interface {
	File
	
	// ContentHash returns content hashes of the file that uniquely
	// identifies the file contents.
	// The returned hashes should be of the form algorithm-base64,
	// and implementations are encouraged to use sha256, sha384, or sha512
	// as the algorithms, for interoperability with other systems.
	//
	// ContentHash must NOT compute any hash of the file during the call.
	// That is, it must run in time O(1) not O(length of file).
	// If no content hash is already available, ContentHash should
	// return nil rather than take the time to compute one.
	ContentHash() []string
}

@oliverpool
Copy link

About the concerns raised in #43223 (comment) regarding the new ContentHash() []string proposition:

  1. The API as proposed does not let the caller request a particular implementation.

Since it seems to be agreed that the hash must be pre-computed, I think that giving multiple hashes for the caller to choose from would be flexible enough, while not putting any complex logic in the FS.

  1. The API as proposed does not let the implementation say which hash it used. Callers may care so they can double-check the result or confirm that it's one that they will accept (although that latter case bleeds into 1 a little)

The comment of @rsc to suggest the implementers to use "algo-base64" applies here even more.

  1. The API as proposed does not let the implementation return multiple hashes. That's needed for e.g. SRI transitions to new hash algorithms, or to e.g. let clients prefer hashes that are cheaper to verify (say, blake3).

This new adaptation would address #43223 (comment), raised by @tv42

Is there any concern before I open a new proposal with this adapted interface?

@golang golang locked and limited conversation to collaborators Dec 18, 2022
@mauri870
Copy link
Member

mauri870 commented Nov 2, 2023

Proposal reconsidered in #60940

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests