-
-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
caddyhttp: Smarter path matching and rewriting #4948
Conversation
Expected for path matching and rewriting. Fixes #4923.
This allows matching spans of raw/URI-escaped portions of the path.
(Thanks for the great review, @RussellLuo!)
Very well reasoned, but I think this is something that you'll never regret not merging, and that you may very well regret putting into the codebase for a use case that is much more art (abusing constraints) than software engineering. |
@coolaj86 Hiya! (I know I just replied on Twitter, but) I disagree with the part about abusing constraints, but I do think you could be right. This is not my favorite change, but it doesn't violate any standards (and in fact it is perhaps more conforming in some ways than other implementations) and it provides an elegant solution to otherwise hacky or non-compliant alternatives. Unless I have overlooked some significant problem with this approach, I do think it's well-engineered. In some ways it gives Caddy another competitive technical advantage. But yes, I do hope I don't regret this. 😅 |
YOLO! |
// CleanPath cleans path p according to path.Clean(), but only | ||
// merges repeated slashes if collapseSlashes is true, and always | ||
// preserves trailing slashes. | ||
func CleanPath(p string, collapseSlashes bool) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave this another thought.
Typically I guess dot segments ("." and "..") are rarely used in URLs, but I'm afraid the behavior of CleanPath
is hard to reason about if dot segments are used along with multiple slashes.
For example, I'm not sure if /foo//./bar
is the result most users expect 🤔
p | CleanPath | path.Clean |
---|---|---|
/foo//./..//./bar | /foo//./bar | /bar |
/foo/./.././bar | /bar | /bar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting! I'll look into this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we need to decide if /foo//..
is /
or /foo
, i.e. whether //
counts as a path element (an empty one) or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this is an important point!
Obviously, path.Clean chooses /
, but personally I think /foo
is a better choice because URI Empty Path Segments Matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same author made an interesting point about path matching in Application Content URI Rules wildcard syntax:
For the path, an asterisk matches within the path segment. For example, http://example.com/a/*/c will match http://example.com/a/b/c and http://example.com/a//c but not http://example.com/a/b/b/c or http://example.com/a/c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RussellLuo Thanks for those resources. I think I tend to agree, if we're not merging slashes, then treat empty path components as significant. I pushed a commit just now which makes the new test case pass.
Co-authored-by: RussellLuo <luopeng.he@gmail.com>
This branch resolves several inconsistencies across Caddy's HTTP facilities regarding URI encodings in paths.
I am not entirely sure, but I suppose breaking changes might be possible if users relied on buggy behavior that has only just been determined and is being remedied here.
This PR mainly affects the
path
matcher and therewrite
middleware (including both therewrite
anduri
Caddyfile directives). These are extremely commonly-used Caddy features.Background
URIs (essentially the part of the URL after the scheme and authority/host, e.g.
/foo/bar?a=b#frag
-- though servers don't really deal with#fragment
components) are famous for being inconsistently encoded and parsed. Differences in parsing/handling between servers, proxies, and applications often lead to bugs and security vulnerabilities. For example, a path of//foo/bar
might be considered equivalent to/foo/bar
by one piece of infrastructure, and different to another. Similarly,/foo%2Fbar
might or might not be the same as/foo/bar
. To a router, they could be different. To an application, they could be the same.A web server like Caddy is between a rock and a hard place, because it finds itself between untrusted clients who send all manner of inconsistent requests, and other servers or applications who expect the request URI to be just right. Caddy is often expected to route requests of all varieties and rewrite/transform them into something the backend application (even if that's just the built-in static file server) can use without confusion. The problem is the requirements and expectations vary widely!
Caddy has had several issues over the years where some users expect a URI like
/foo%2Fbar
to be transformed into/foo/bar
before being proxied. Some want/foo/bar
to match/foo%2Fbar
, while others don't. Some want a matcher like/secret/*
to match URIs like//secret/*
or/secret//*
because they put it behind authentication, and if it doesn't match, auth could be bypassed! Windows treats/file.php . ..
the same as/file.php
-- even though they technically have different suffixes and file extensions, causing routing blunders. Then imagine a path prefix like/bands/*/*/
that should match/bands/Pink/Try/
as well as/bands/AC%2FDC/T.N.T
-- but if the path matcher normalizes (decodes) URIs before matching, the first URI would work but the second would become/bands/AC/DC/T.N.T
which doesn't match the pattern anymore. To make matters worse, any given URI has multiple valid encoded forms.%2F%66%6F%6F%2F%62%61%72
can be decoded to/foo/bar
just as well as/foo%2Fbar
can, and everything in between can, too. If routers matched on non-normalized URIs, there would be plenty more security bugs to deal with: a pattern of/foo/*
, which is expected to be authenticated, would no longer match/foo%2Fbar
even though they are, according to ratified RFCs, equivalent.In other words, encodings are significant to applications, but normalizing URIs to a consistent form is critical for maintaining security.
Let me restate here what I wrote for the Laravel community when I started working on this (with minor changes to make sense out of context):
RFC 9110, "HTTP Semantics," has a section on HTTP URI normalization, which says:
In other words,
/foo%2Fbar
and/foo/bar
are equivalent after normalizing, and thus they SHOULD NOT be used for distinct resources. So if you are encoding application data into the path, and that data could possibly have reserved characters / delimiters (like/
), consider redesigning your API: it is not robust in the harsh HTTP environment.Note that several RFCs, notably RFCs 3986 and 9110, continually repeat that URI parsing is dependent upon scheme. That's one other problem: we all use the
http://
orhttps://
scheme and yet expect applications to handle URIs differently. So of course there's going to be head-butting: we're fighting the design.To clarify, it is definitely possible for a URI path such as
/band/AC%2fDC/T.N.T
to be "properly" handled by a server application. For this case, simply write a server that decodes everything after/band/
except%2f
. 🤷♂️ The problem is that this is difficult in general. Depending on what situations you do this, you may be opening yourself to bugs and security holes. This is why Caddy currently handles URIs solely in the unescaped space: it's the "one true" representation of a URI, and normalized HTTP URIs are more or less clearly defined nowadays.Others might propose a solution to double-encode application data in the path; in other words, have the client send a URI with a path of
/bands/AC%252fDC/T.N.T
. This will probably work, but it's a hack and it violates spec:Beware of the non-conforming behavior and highlight it very prominently in documentation so you can avoid bugs.
Laravel user @alcaitiff made a comment that some of you may be thinking:
I can't speak for Laravel or what it's doing, but the Go standard library (what Caddy uses), for example, does do URL parsing correctly and still has this problem. Go does exactly what you and the spec recommend: it splits the URI into its components and then decodes reserved characters after parsing. It preserves the original, "raw" path in the
RawPath
field and offers the decoded path in thePath
field. ItsEscapedPath()
method usesRawPath
if it is a valid encoding ofPath
, which is interesting because any given path has multiple valid encodings as I noted above. So if I want to truly "normalize" the URI in Go, I have to callurl.PathEscape(req.URL.Path)
myself and ignoreRawPath
entirely (AFAIK). And guess what, this converts/foo/bar
to.../foo/bar
. In other words, decoding/foo%2Fbar
is not reversible without loss of precision. (Unless the HTTP server knows your business logic, more on that in a moment.)We can write our own logic, though, that uses
RawPath
as a "hint" (as the Go docs say) to maybe replace/
with%2f
, but if we've manipulated/rewritten the URI at all, this becomes infeasible because we don't know where or if that instance still exists in the string.RFC 3986 section 2 states:
The
/
is in the reserved set. Thus it is up to the implementation to determine whether it is data or a delimiter. I guess Laravel doesn't know, and it's frankly safer to assume it's a delimiter and treat it in its normalized form.So yes, this issue is frustrating. As a web server author, I feel like I need to write software that can read people's minds: is this slash data or is it a delimiter? The router needs more information, because both are very valid ways of interpreting a URI!
The solution
I think the key to this problem is trying to read the developer's mind: is this character supposed to be a delimiter (part of the path) or data? Should we collapse repeated slashes or no?
The answers depend on the context. For routing / path matching, the answer may be one way, for rewriting it may be another, and for proxying it may be yet another depending on the applications being proxied to.
Nginx, Apache, and Caddy all merge slashes by default when matching. However, Nginx and Apache have options to disable that behavior and preserve the slashes, which can lead to security vulnerabilities. All three do path matching (or routing) in the normalized space to mitigate bugs but, like we saw with Laravel, makes it difficult or impossible to route requests with application data that decode as path-significant characters like
%2F
(/
), leaving many developers frustrated.This PR introduces a somewhat novel solution that allows the developer to convey their intent to the server when doing matching and rewriting.
Simply put, our solution is to interpret encoded characters and multiple slashes in the configuration as a literal conveyance of the developer's intent. In other words, we don't blanket-unescape the whole URI every time. We do it byte-for-byte in lock-step with the configured pattern to match, and only unescape if the match pattern is not escaped at that position. Similarly, if a configured path has double slashes
//
in it, we do not merge slashes when comparing paths, because we infer the user's intent is to match repeated slashes.Path matching
Path matching (aka routing) is still done in the normalized space. That means if you configure a path matcher of
/foo/bar
, it will match/foo/bar
,/foo%2Fbar
and even%2F%66%6F%6F%2F%62%61%72
because we normalize the URI. This is unchanged from before.But now if you have a path matcher of
/foo%2Fbar
, it will match/foo%2Fbar
exactly (the escape sequence is case-insensitive), whereas previously it would have only matched/foo%252Fbar
(i.e.%
as data). Now,/foo%2Fbar
will NOT match/foo/bar
or%2Ffoo/bar
because we infer intent from seeing escape sequences in the match pattern as application data, not path delimiters.This logic handily extends to wildcards, too. Referring to the previous example from our Laravel discussion, if you want to use
/bands/*/*
it is impossible to match a URI of/bands/AC%2fDC/T.N.T
(in Laravel, too). But with this change, you can use special "escape-wildcard" characters:/bands/%*/%*
to indicate that the span matched by the wildcard should not be URI-decoded and should be kept in the escaped/raw space.So now, if you want to allow band names to have a
/
in them, you can simply write/bands/%*/%*
.Double slashes
Similar to escape sequences, we now disable slash merging automatically if the configured pattern has repeated slashes. Previously, it was impossible to match
//foo
because all URIs were normalized. Now, a path matcher of//foo
will preserve multiple slashes. (A matcher of/foo
will still match//foo
.)Rewriting
A common task of rewriting is to strip path prefix and path suffix. The logic explained above has also been implemented for these operations, allowing you to use escaped characters and multiple slashes in your prefix and suffix patterns, and now Caddy will rewrite more intuitively and correctly.
For example, if you want to strip a prefix of
//prefix
from//prefix/foo
, it will work, whereas before it wouldn't find the prefix because it would look at a fully-normalized URI.Similarly, you can strip prefixes or suffixes with encoded characters. For example, a prefix of
/foo%2Fbar
will rewrite a URI of/foo%2Fbar/asdf
into/asdf
, whereas before it wouldn't find the prefix.Is it perfect?
Probably not. Are there bugs? Probably. Have I overlooked things? Almost certainly yes. I'm pretty sure there might be nooks and crannies within Caddy that I missed implementing this. Please file a bug report if you need it to work but doesn't work like you expect.
I'm pretty happy with this approach though. I think it's very useful and I don't know of other mainstream servers or frameworks that implement this behavior. In true Caddy fashion, this should just work.
path_regexp
matching whitespace breaksuri strip_prefix
#4801