-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: io/fs, net/http: define interface for automatic ETag serving #43223
Comments
This comment has been minimized.
This comment has been minimized.
I feel "the returned hash should be text, such as hexadecimal or base64" is either not restrictive enough, or too restrictive. I think it would be reasonable to use the identifier in a URL, but not all base64 encodings are usable as such. So, IMO, we should either
I think the last would be the least convenient in actual usage, but the most flexible. #43076 is similar to this issue in the proposed solution and the addressed problem (so one or the other should be closed as duplicate, I think). It contains some discussion of other use-cases and how the API would need to be more complicated to serve them - notably, it was suggested that the hash should be able to be of arbitrary length and that it should be possible to specify the required hash-algorithm (useful if used e.g. in subresource integrity). I'm not fully on board with that being needed, but I wanted to bring it up. @johnjaylward I think |
A concern that the function execution time is relative to the size of the file is not O(1) or ω(1). It would be some function of Big O notation should be used correctly, or it should be avoided in favor of plain English. Using a mix of BigO and something else just feels wrong. Since what I believe @rsc is trying convey is that Constant time should be guaranteed, I think plain English should be preferred. Something like:
Optionally maybe specifying what isn't expected:
I'm also concerned about what this is expected to happen if the |
This comment has been minimized.
This comment has been minimized.
It continues with no ETag. |
Understood, it is more of a nitpick than anything worth dying over. I'd prefer to drop improper use of |
Will it ever be necessary to know what kind of content hash function is being used by the underlying file system? For example, perhaps a security sensitive application would be willing to accept a SHA-256 hash but not an MD5 hash. |
I don't know, and that's part of why we didn't propose anything for Go 1.16. ContentHash seems too vague at some level, but calling it ETag is too specific. We could call it something else - ContentID? - but it doesn't change the fact that the FS is in charge of which hash gets used, and the caller is expected not to be picky. |
Either the acceptable character set of string has to be defined very clearly, resulting in either a one-size-fits-nobody solution or double encoding for most uses, or encoding should be left to the party that knows what encoding is needed (my preference). ETag content (inside the quotes) character set is actually pretty different from most other uses:
|
maybe relevant; https://github.com/benbjohnson/hashfs |
What if Obviously a strawman, but how about making the interface definition something like (clearly modifying the example in the OP): package fs
import "hash"
// HashAlgorithm implementations
type HashAlgorithm interface {
NewHash() hash.Hash
String() string
}
// A ContentHashFile is a file that can return its own content hash efficiently.
type ContentHashFile interface {
File
// ContentHash returns a content hash of the file that uniquely
// identifies the file contents, suitable for use as a cache key.
// The returned hash should be text, such as hexadecimal or base64.
//
// The HashAlgorithm-typed argument is a prioritized list of
// hash-algorithm, most-proffered first. (a nil argument may return an
// arbitrary, but consistent, implementation-defined content-hash)
//
// ContentHash must NOT compute the hash of the file during the call.
// That is, it must run in time O(1) not O(length of file).
// If a matching content hash is not already available, ContentHash
// should return an error rather than take the time to compute one.
ContentHash([]HashAlgorithm) (string, error)
} Then, individual packages ( I figure making the It might make sense to define the Note: an enum-ish type would be another option, but then individual implementations would be limited to just the types defined here. |
@dfinkel I think as #43076 proves, designing an API to do anything we could reasonably want from it is a very deep rabbit hole. I feel like we should at least first agree if this is a rabbit hole we want to go down, before we get ourselves stuck in it. That is, what are the use-cases we would consider hard requirements? So far, this issue is titled "define interface for automatic ETag serving", so a priori that should be considered the only requirement. And for that, the interface you are suggesting is not just unnecessarily complex, but I would argue even counterproductive - it suggests that a Of course ETag serving doesn't have to be the only requirement. But importantly, we should make a case for other use-cases to be required first, before trying to design an API for them. |
@Merovius If this use case truly were ETags, then the function discussed here should do all the proper ETag formatting, including the double quotes. Personally, I think that's a silly use case, by itself. |
HTTP client side caching is not a silly use case ... |
@tv42 I tend to agree. And, to repeat, I'm not even saying we should limit ourselves to ETag serving. I don't know where, or how to draw the line. But I think a line needs to be drawn, because most of the APIs mentioned in #43076 seem unacceptably complex for the stdlib to me (though that might just be a personal taste). FWIW, you mentioned Subresource-integrity as another use-case in that issue and I think that would be an interesting case to at least think about. |
@Merovius thanks for the pointer to #43076, I missed it before. I'll be honest, I don't actually care that much about ETags, but I do see end-to-end checksum protection as critical for storage systems at scale. The cryptographic hashes are great for this goal, but only part of the story. I have a strong preference for crc32c in many places because it's cheap to compute the CRC of the concatenation of two blobs that you already have the CRC for (and most x86 and ARM chips have instructions to accelerate CRC32c computations explicitly). IMO, creating an interface within The reason I included the One thing to keep in mind is that availability of specific hashes for some filesystems is history-dependent. e.g. GCS (and S3) don't store an MD5 if you've used the "Compose" API to concatenate several objects. (although GCS still has a CRC, and S3 has some other value in its ETag) |
@sylr Locking into doing nothing else but ETags would, in my opinion, be silly. Embracing Subresource Integrity (https://w3c.github.io/webappsec-subresource-integrity/) would mean the return value should be (a set of) 1) hash algo 2) digest 3) optional options. If those are kept separate, then the ETag use case can just use the truncated digest as the verifier.
|
As far as I know, the file systems that make a hash code available only make a single hash code available. If that is true, then I don't see a good argument for requesting a specific hash algorithm, or a set of acceptable hash algorithms. It would suffice to have some way for the file system to report which hash algorithm it is using. I don't know of any current listing of hash algorithms. So I would tentatively recommend using package paths. // A ContentHashFile is a file that can return its own content hash efficiently.
type ContentHashFile interface {
File
// ContentHash returns a content hash of the file that uniquely
// identifies the file contents, suitable for use as a cache key.
// The returned hash should be text, such as hexadecimal or base64.
// The returned algorithm should be a package path, preferably in the
// standard library, that identifies the hash algorithm being used; for example:
// "hash/crc32" or "crypto/sha256".
//
// ContentHash must NOT compute the hash of the file during the call.
// That is, it must run in time O(1) not O(length of file).
// If a content hash is not already available, ContentHash should
// return an error rather than take the time to compute one.
ContentHash() (hash, algorithm string, error)
} |
@ianlancetaylor If you force everything to only provide just one hash digest, it's impossible to gradually migrate to another hash algorithm. You can't switch everything over to the new one when the clients don't have the support, but you can't provide the new hash along with the old hash if the API is forcing just one answer. GCS seems to have CRC32C and MD5 at this time, and has a mechanism for adding more if they wanted to: https://cloud.google.com/storage/docs/hashes-etags https://cloud.google.com/storage/docs/xml-api/reference-headers#xgooghash Here's AWS S3 talking about MD5 and a tree hash based on SHA256: https://docs.aws.amazon.com/amazonglacier/latest/dev/checksum-calculations.html Min.IO is S3-compatible so they store MD5s, but it also uses the faster HighwayHash for integrity protection: https://min.io/product/overview The world will need to move beyond MD5 at some point. |
Thanks. If file systems are already storing and providing multiple different kinds of hashes, then I was mistaken. I guess one question would be whether it is ever important for users of this interface to ask for multiple hashes. When would you not want the most secure version? And another question would be whether package paths are a reasonable way to describe the hash. |
@ianlancetaylor Personally, I don't like the idea of using package paths to describe the hash. They are easy to typo and can't be compiler-verified. I have suggested to use package-scoped values instead. That is, you'd have e.g. Of course, the downside is that this requires an import to actually use the handle. I'm not sure how big of a problem that is though - IMO an extra import only has an added cost, if you link an otherwise unused package into the binary and I would argue that if you want to use |
In the case of serving content from a AWS S3 you probably just want to straight up use their premade ETag which differs depend on how the files are stored: AWS DOCS: The entity tag represents a specific version of the object. The ETag reflects changes only to the contents of an object, not its metadata. The ETag may or may not be an MD5 digest of the object data. Whether or not it is depends on how the object was created and how it is encrypted as described below:
|
I don't know but maybe an etag specific interface specifically indended for use by http.FS in the standard library wouldn't be super bad idea. |
@thomasf I think an ETag-specific interface within As some here have pointed out, ETag is a relatively trivial use-case, since it only needs a hash, but doesn't care which one, as long as it's consistent. (strictly speaking it doesn't even need to be a content-hash, so long as it changes with the content -- a version number or commit-hash might suffice in some cases, but that's beside the point because there's already mtime handling in There are important use-cases involving specific content-hashes, and it seems much better to have a function in If a Note: I don't have a use-case for exporting |
When clients verify the hash, when it's not used as just an opaque identifier. Consider: Most of current clients implement hash algo A, which is starting to show weaknesses. The crypto community is starting to feel good about hash algo B. You cannot just start serving only B digests, as most current clients wouldn't know what to do with them. Hence you provide A and B, and wait for clients to implement B before dropping A. |
It sounds like there are two concerns.
For (1), I think that's a different proposal. There will be callers who don't care about the implementation. They just want any hash the file system has available. Like HTTP does for ETags. That's the focus of this proposal. I don't believe we have to engineer it more than that. For (2), it seems like maybe at least parts of the world are converging on the syntax used by Subresource Integrity That makes the result self-identifying and it's still useful as an ETag (arguably more useful). So the interface would be:
Thoughts? |
Also, just "looks to similar to SRI format" doesn't allow one to pass through the returned value to SRI. If the algo strings aren't what the SRI registry wants, there's going to be a remapping/standard enforcement layer anyway, for anyone who's a careful programmer. Combined with the "one winner only" concept, SRI's limited hash selection means you might not end up with a SRI-compatible hash digest at all. |
Is there a reason the proposed interface is very focused on hashes of file content? ETags can be generated by any means as long as they uniquely identify a specific version of a file, as per RFC7232. The RFC also specifically states that there is no need for the client to know how the ETag is constructed. I don’t think the interface should be limited to content hashes as some other types of ETag are common, simple and useful, such as timestamps or even just revision numbers. ETags need not be unique across different files. To cater for ETags specifically, I think the interface should allow more than just content hashes. For example, If this interface is going to be used more generally and for purposes other than ETags, I don’t think that should necessarily limit its originally proposed purpose for ETags. As well as that, this proposal only allows for ETags when serving |
@patrickmcnamara It's hard to produce unique identifiers across multiple servers and multiple builds of a program. The one way we know that works very well is content hashes. Also we are trying to design something that programs might want for reasons other than ETag generation. |
@tv42 Your (3) I was including as part of (1), at least as far as the answer: that's a more complex, different interface. This is the simple interface meant to be used for detecting file changes and uniquely identifying content with some reasonable hash. The format change gives a way for implementations and consumers to understand which one is being used. So as far as "it's impossible to gradually migrate to another hash algorithm", that's out of scope here. And for the record, CRC32C should not be a valid implementation of ContentHash. |
Are there any other objections to #43223 (comment)? Would people object to dropping this entirely and not setting ETag automatically for embedded content served over net/http? |
A third-party library can provide a wrapper around http.FS that adds ETags by walking the fs.FS and pre-computing all the hashes and then using them to set ETags. The amount of embedded data is unlikely to be large enough that the pre-computation will take much time at all. (SHA256 runs at 350 MB/s on my laptop.) That would provide content-hash ETags still without any explicit "go generate" step. Given that, here is a counter-proposal for thumbs up/thumbs down reactions: Drop this proposal and let third-party code fill this need. Thumbs up this comment if you think we should do nothing and abandon this issue. Thumbs down if you think it's important to do something here in the standard library. Thanks! |
It also feels inelegant to me, to simply discard the hash information the compiler already adds (though TBF, I don't know why it adds it, it doesn't seem to be used, so maybe that is just an artifact of the assumption that we'd want to add ETags). |
Not directly related to Go, but is tangential - http.FileServer will not honor an ETag set via a wrapping http.Handler if the FileInfo's ModTime is not the 0 time. If you compile your Go code with Bazel the ModTime will get set to 1/1/2000 due to how Bazel deals with reproducible builds. bazelbuild/bazel#1299 |
I don't think its super important to do in the standard library, but I think it would be a nice-to-have. I think if the price of this feature is a new public interface in the standard library "ContentHashFile" then it is not worth it. If this can be added without any new API, then I think its almost a pure win. |
@Merovius The hash was there just for this issue. It is unused by package embed. And it would be easy to remove and reclaim those 16 bytes. |
FWIW, I did say that we would figure out how to make ETags work, but part of figuring it out is determining the design, through discussions like this one. And it is starting to sound like the right way to make it work is to suggest using an ETag-adding layer as described in #43223 (comment). That comment has slightly more thumbs up (leave for third-party packages) than thumbs down (do something in std library) right now. Will leave this open, but we don't seem to be converging on any plan forward in the standard library. |
I see the thumbs downs on my previous comment, and I apologize that we don't know what to do. But it seems clear that we don't know what to do. So we don't have a consensus about what to do, which usually means decline. That's probably what we should do - decline - and use embed for a while and see what ideas people come up with. |
All I have to say is #41191 (comment) #35950 (comment) |
Based on the discussion above, this proposal seems like a likely decline. |
No change in consensus, so declined. |
I had an idea after reading this discussion, but I am not sure if it is good enough to open a new proposal. Maybe the signature of
I would suggest simply letting I think this addresses "it's impossible to gradually migrate to another hash algorithm", which I understood was the main blocker.
|
About the concerns raised in #43223 (comment) regarding the new
Since it seems to be agreed that the hash must be pre-computed, I think that giving multiple hashes for the caller to choose from would be flexible enough, while not putting any complex logic in the
The comment of @rsc to suggest the implementers to use "algo-base64" applies here even more.
This new adaptation would address #43223 (comment), raised by @tv42 Is there any concern before I open a new proposal with this adapted interface? |
Proposal reconsidered in #60940 |
In the discussion of io/fs and embed, a few people asked for automatic serving of ETag headers for static content, using content hashes. We ran out of time for that in Go 1.16, in part because it's not entirely clear what we should do.
Here is a proposal that may be full of holes.
First, in io/fs, define
Second, in net/http, when serving a File, if it implements ContentHashFile and the ContentHash method succeeds and is alphanumeric (no spaces, no Unicode, no symbols, to avoid any kind of header problems), use that result as the default ETag.
The text was updated successfully, but these errors were encountered: