Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle changing Content-Encoding for inconsistent backends? #3169

Closed
dridi opened this issue Dec 19, 2019 · 32 comments
Closed

Handle changing Content-Encoding for inconsistent backends? #3169

dridi opened this issue Dec 19, 2019 · 32 comments

Comments

@dridi
Copy link
Member

dridi commented Dec 19, 2019

Test case by @daghf:

varnishtest "Ensure we don't override Content-Encoding on a cond fetch"

server s1 {
    rxreq
    txresp -hdr {ETag: "foo"} -body {****}

    rxreq
    expect req.http.If-None-Match == {"foo"}
    txresp -status 304 -hdr {ETag: W/"foo"} -hdr "Content-Encoding: gzip"

    rxreq
    expect req.url == "/1"
    txresp -hdr {Etag: "bar"} -gzipbody {asdf}

    rxreq
    expect req.url == "/1"
    expect req.http.If-None-Match == {"bar"}
    txresp -status 304

} -start

varnish v1 -vcl+backend {
    sub vcl_backend_response {
        set beresp.ttl = 1s;
        set beresp.grace = 0s;
        set beresp.keep = 1d;
    }
} -start

client c1 {
    txreq -hdr "Accept-Encoding: gzip"
    rxresp
    expect resp.status == 200
    expect resp.http.ETag == {"foo"}
    expect resp.http.Content-Length == "4"
    expect resp.http.Content-Encoding == <undef>

    delay 1.5

    txreq -hdr "Accept-Encoding: gzip"
    rxresp
    expect resp.status == 200
    expect resp.http.ETag == {W/"foo"}
    expect resp.http.Content-Length == "4"
    expect resp.http.Content-Encoding == <undef>

    # Also check that we're still OK in the general case
    txreq -url "/1" -hdr "Accept-Encoding: gzip"
    rxresp
    expect resp.status == 200
    expect resp.http.ETag == {"bar"}
    expect resp.http.Content-Encoding == "gzip"

    delay 1.5

    txreq -url "/1" -hdr "Accept-Encoding: gzip"
    rxresp
    expect resp.status == 200
    expect resp.http.ETag == {"bar"}
    expect resp.http.Content-Encoding == "gzip"
} -run
@nigoroll
Copy link
Member

nigoroll commented Dec 20, 2019

I am not sure if the test case is valid.

A behaving backend would

  • not send Content-Encoding: gzip with a 304 matching the ETag if the original 200 with that Etag was not compressed
  • have to send Vary: Accept-Encoding if it produced different results based on Accept-Encoding.

VCL can already work around such cases, and I do not think we should special case the built-in 304 code for such misbehaving backends.

@dridi
Copy link
Member Author

dridi commented Dec 20, 2019

I believe @daghf worked around the problem for the support case that brought this up, and left me a note saying "warrants more research". I should probably have mentioned that in the first place, I opened this issue so Someone:tm: could do that research.

@nigoroll
Copy link
Member

All good and well, but IMHO the result of the research would be that nothing is wrong in varnish-cache.
Please convince me if your result differs.

@carlosabalde
Copy link

I think the following test shows the problem from a different point of view. In this case is Varnish the one generating the 'wrong' 304 response.

varnishtest "..."

server s_1 {
    rxreq
    txresp -hdr {ETag: "foo"} -body {****}

    rxreq
    expect req.http.If-None-Match == {"foo"}
    txresp -hdr {ETag: W/"foo"} -gzipbody {****}
} -start

varnish v_1 -vcl {
    backend default {
        .host = "${s_1_addr}";
        .port = "${s_1_port}";
    }

    sub vcl_recv {
        return (pass);
    }
} -start

varnish v_2 -vcl {
    backend default {
        .host = "${v_1_addr}";
        .port = "${v_1_port}";
    }

    sub vcl_backend_response {
        # During second client request, at this point
        # 'beresp.http.Content-Encoding' is set to 'gzip' due to 304 response
        # coming from 'v_1', but cached object is not gzipped. Chaos!

        set beresp.ttl = 1s;
        set beresp.grace = 0s;
        set beresp.keep = 1d;
    }
} -start

client c1 -connect ${v_2_sock} {
    txreq -hdr "Accept-Encoding: gzip"
    rxresp
    expect resp.status == 200
    expect resp.http.ETag == {"foo"}
    expect resp.http.Content-Length == "4"
    expect resp.http.Content-Encoding == <undef>

    delay 1.5

    # Broken response: body is uncompressed but tagged as gzipped!
    txreq -hdr "Accept-Encoding: gzip"
    rxresp
    expect resp.status == 200
    expect resp.http.ETag == {W/"foo"}
    expect resp.http.Content-Length == "4"
    expect resp.http.Content-Encoding == "gzip"
} -run

@nigoroll
Copy link
Member

@carlosabalde no. v_1 is still forwarding a wrong response from s_1, one which it can and should do nothing about. If you chose to configure v_1 for a cached response instead of the pass, you should see correct behavior.

@carlosabalde
Copy link

carlosabalde commented Dec 20, 2019

@nigoroll, just a few comments:

  • Not sure if we can say s_1 is generating a wrong response. You can think about it as a round robin director balancing two servers, one with compression disabled, and the other one with it enabled. In fact that's the real scenario which motivated the original support case.

  • I don't think this is correct: 'If you chose to configure v_1 for a cached response instead of the pass, you should see correct behavior'. Setting ttl=0.5s, grace=0s, keep=0s in v_1 won't change anything here. The test is just a toy scenario.

  • Even assuming s_1 is generating a wrong response, I don't think there is anything you can do in VCL (excluding forcing uncompressed responses from the backend side) to workaround the situation in a multi-layer Varnish setup. In order to do that you'll need something similar to the ideas described in https://varnish-cache.org/lists/pipermail/varnish-dev/2019-October/009469.html.

In any case, I'll let this one to @daghf. Just trying to share some details about the original support case.

@nigoroll
Copy link
Member

@carlosabalde

  • Backends in a round-robin director need to behave consistently. And I know they do not always do, which is one of the reasons for the shard director existing in the first place. If you used shard and sent the same url to the same backend always, you should not see this issue.
  • By for a cached response I mean one where the cache is actually hit.
  • In VCL, you can add the missing Vary.

@rezan
Copy link
Member

rezan commented Dec 20, 2019

The scenario described by Carlos is pretty legit. Think about a fallback director where we fallback to a different origin and get different gzip behavior. The fact I have seen this twice tells me this is a real issue.

@carlosabalde
Copy link

Again, trying to provide some insight about the original support case & focusing in the workaround part:

  • In fact the shard director is already being used, but the shard criteria is user-based and that cannot be changed (i.e. local sessions, etc.). Therefore, not an option here.

  • I don't see how injecting a Vary: Accept-Encoding header in VCL would change anything. You can assume a fixed Accept-Encoding: gzip and this still would be an issue.

@nigoroll
Copy link
Member

nigoroll commented Dec 20, 2019

@rezan Still, backends need to behave consistently.

@carlosabalde You would add the Vary for the backend which does not send a gzip response.

And there are probably many more fixups possible, like adding a server id to the Etag.

@nigoroll
Copy link
Member

@carlosabalde also I do not understand how, if you got a shard director by session id, you would get a 304 from another server for a personalized cache object.

@carlosabalde
Copy link

@nigoroll problem is both fixups are not VCL based. But I get your point.

About the sharding thing: for example user U1 (session S1, mapping to server B1) fetches URL (think about a non personalized content like a CSS) from B1 (i.e. obviously not all URLs are personalized, but you cannot know it in advance just looking at the URL; you can also assume personalized URLs won't be cached). Some time later the object TTL expires. The user U2 (session S2, mapping to server B2) fetches same URL, this time from B2. That's it.

@rezan
Copy link
Member

rezan commented Dec 20, 2019

If you fallback to an alternate origin, not sure why we expect consistent behavior. And again, Ive seen this twice, so im not sure why @nigoroll is digging in so hard here. Lets fix the problem so other dont fall into the same trap and blame Varnish for corrupted responses.

@dridi
Copy link
Member Author

dridi commented Dec 20, 2019

@nigoroll by insisting on backends behaving consistently you close the door to practices like A/B testing or rolling upgrades in a cluster. And for that matter, even with a single backend, enabling compression may be done at run time while there are objects in the cache.

I mean, it's expected to have poor HTTP support from HTTP servers with such a complicated protocol. Even varnishd will trip on its own feetures when it comes to that:

varnishtest "fantom client accept-encoding"

server s1 -repeat 2 {
	rxreq
	txresp -gzipbody hello
} -start

varnish v1 -vcl+backend "" -start

client c1 {
	txreq
	rxresp
	expect resp.http.content-encoding == <undef>
	expect resp.http.vary != <undef>
} -run

varnish v1 -cliok "param.set http_gzip_support off"

client c2 {
	txreq
	rxresp
	expect resp.http.content-encoding == gzip
	expect resp.http.vary != <undef>
} -run

# not getting a hit despite two identical client requests, dafuq?
varnish v1 -expect cache_hit == 0
varnish v1 -expect cache_miss == 2

I understand that this is something we can work around in VCL, but I find it a bit harsh to close the issue before the "warranted research" was conducted. I opened this issue to make sure we keep track of this and give time to @daghf or someone else to do it.

Can we agree to reopen?

@nigoroll
Copy link
Member

First of all, I want to apologize for having provided at least incomplete advise on the VCL workaround. Adding Vary: Accept-Encoding only is not sufficient.

I closed this issue because I fundamentally oppose changing the straight forward header merge in core for this special case.

@nigoroll nigoroll reopened this Dec 21, 2019
@nigoroll
Copy link
Member

and btw @dridi I do not see how changing http_gzip_support would be related or what would be wrong about it. The "fantom" test just demonstrates how gzip support works to begin with.

@nigoroll nigoroll changed the title Ensure we don't override Content-Encoding on a cond fetch Handle changing Content-Encoding for inconsistent backends? Dec 21, 2019
@nigoroll
Copy link
Member

nigoroll commented Dec 21, 2019

Let me summarize what is all going wrong here

  • The backends are configured inconsistently
  • The backends provide identical Etags for different bodies
  • The backends should not send Content-Encoding with the 304. From RFC 7232:

Since the goal of a 304 response is to minimize information transfer when the recipient already has one or more cached representations, a sender SHOULD NOT generate (any other headers but Cache-Control, Content-Location, Date, ETag, Expires, and Vary) unless said metadata exists for the purpose of guiding cache updates (e.g., Last-Modified might be useful if the response does not have an ETag field).

@nigoroll
Copy link
Member

bugwash: sub vcl_backend_refresh from #3102 would provide a convenient way for a vcl fix also.
@dridi and myself seem to agree that we want to have less policy in core, not more.

@rezan
Copy link
Member

rezan commented Dec 23, 2019

Isn't this still a bug? We are corruptung response bodies. Adding a new VCL state and telling users to fix this themselves isn't really a fix?

@nigoroll
Copy link
Member

nigoroll commented Dec 23, 2019

@rezan As explained before, I do not think this is a Varnish bug. But making VCL workarounds / fixes easier would be a good thing, and I would even consider adding unset beresp.http.Content-Encoding to sub vcl_backend_refresh of the builtin.vcl

@rezan
Copy link
Member

rezan commented Dec 23, 2019

Ok, then I think we just need consensus that this isnt a bug.

Just to confirm, you are saying that Varnish now requires consistent backends? If you have inconsistent or different backends, there is a risk Varnish will corrupt responses?

@rezan rezan closed this as completed Dec 23, 2019
@rezan rezan reopened this Dec 23, 2019
@rezan
Copy link
Member

rezan commented Dec 23, 2019

Sorry, my mistake hit the wrong button

@nigoroll
Copy link
Member

@rezan please try to avoid putting words into my mouth. I do not think I have said anything of what you asked if I had said.

now requires consistent backends?

In my mind, the Varnish approach has always been not to handle evil or bad backends, but rather work under the assumption that varnish administrators know their backends. In #3169 (comment) I am giving reasons why this case is about misbehaving backends.

If you have inconsistent or different backends, there is a risk Varnish will corrupt responses?

Sure, this has always been the case. Take, for instance backend A (momentarily) redirecting URL-A to URL-B and backend B redirecting URL-B to URL-A. Now if you happen to use the shard director sending URL-A to backend A and URL-B to backend B, you got a 301 loop. This is equally real as the case discussed herein, and still it is not Varnish's business to fix this. VCL provides all the tools to avoid the example I gave as well as it provides everything necessary to avoid the issue documented here.

I also agree that we lack options to handle 304s and just wrote how I think this issue could be properly handled from VCL once we get sub vcl_backend_refresh.

@rezan
Copy link
Member

rezan commented Dec 23, 2019

A 301 redirect loop is not a response corruption.

@dridi
Copy link
Member Author

dridi commented Dec 23, 2019

Can we stop the discussion, let Someone:tm: brave-enough go through the relevant RFCs (warranted research anyone?) and come back with concrete conclusions before we engage in further disagreement?

@nigoroll
Copy link
Member

nigoroll commented Dec 23, 2019

@rezan I think I gave a comprehensive answer, illustrated with an example. I also gave a list of reasons why I think this is not a varnish bug and rather a backend issue. I do not think your response picking on my example is helpful, in particular in light of the constructive suggestions I have made going forward.

@dridi please do. I have already shared the result from my own research.

@bsdphk
Copy link
Contributor

bsdphk commented Jan 3, 2020

So...

@nigoroll is right, that backends using same ETag for different objects are just plain wrong, no matter what the difference between the objects.

However, that is exactly the kind of real-world f**kups VCL is supposed to be able to paint over.

It follows that this is a situation which should be dealt with in VCL, not in C-code.

If it is not currently possible to handle this situation in VCL, we need to look at that.

Varnish's gzip support is also a "deal with" facility for backends not doing gzip properly, or at all.

It can be argued that such backends barely exist any more and that Varnish's gzip support should be retired, but that is probably still premature.

@rezan
Copy link
Member

rezan commented Jan 5, 2020

In both cases where I have seen this, Varnish is the bad backend which sends a 304 response with gzip and a matching weakened etag. This was demonstrated by Carlos in this comment:

#3169 (comment)

So the problem description here is:

When using Varnish in a multi tier setup and origin serves a gzip variation (weak etag), Varnish can corrupt responses.

@nigoroll
Copy link
Member

nigoroll commented Jan 6, 2020

re @bsdphk vcl_backend_refresh as suggested in #3102 is the best candidate to solve this and other issues in vcl so far. gzip is a different story which has more aspects to this and we already have ways to control it from vcl, so I think we should keep it out of the game.
re @rezan incomplete handling in VCL is still not a varnish bug. We had already discussed #3169 (comment) and I do not see any new points in your comment.

@nigoroll
Copy link
Member

Actually I changed my mind 180° on this one. An http core draft now contains:

Caches MUST-NOT update the following header fields:
Content-Encoding, Content-Length, Content-MD5, Content-Range, ETag.

So while we should still have vcl_backend_refresh, I do now think that we should actually enforce the above in core code.

@nigoroll
Copy link
Member

concensus @bsdphk @dridi myself: @dridi will merge #3281 to master, then close this ticket and continue in #3102

@nigoroll
Copy link
Member

spexific fix in d921b1f
to be generalized in #3102

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants