-
Notifications
You must be signed in to change notification settings - Fork 2
feat: Make the partner timeouts "softer" during initial cache loading. #337
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r+ w/ a comment.
NOTE: this changes the behavior of the `CONTILE_TEST_MODE` flag to be one of several states: **TestFakeResponse** = return a fake response from the ADM server (previous "true") **TestTimeout** = "timeout" a request to the ADM server.
* Added "test_mode" to heartbeat to verify test mode
With f173d25, I expanded the I updated the old, crappy one to illustrate how to check for it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really sorry, I only glanced at this PR last month and thought it was doing something else.
The source of the 503 errors on startup is contile itself, the handler returns them when the cache contains a TilesState::Populating
entry. Requests set that state when they have a request to adM inflight, preventing subsequent requests from needlessly making the same redundant adM request until the original finishes (#248).
With that change we've ended only getting a pretty number of AdmServerError
s (pretty much all are timeouts) -- they're all in sentry.
So I think we can just have that line in the handler return 204s instead.
setup_module(test_mode="TestTimeout") | ||
url = "{root}/v1/tiles".format(root=settings.get("test_url")) | ||
resp = requests.get(url, headers=default_headers(test="")) | ||
assert resp.status_code == 204 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With AdmLoadError
going through the error handling path we might as well also assert not resp.content
. I think actix-web
doesn't render anything for 204s despite our error handler setting a JSON response body but let's test it.
@@ -192,3 +200,12 @@ def test_bad_click_host(self, settings): | |||
assert len(tiles) == 2 | |||
names = map(lambda tile: tile.get("name").lower(), tiles) | |||
assert list(names) == ["acme", "dunder mifflin"] | |||
|
|||
def test_aatimeout(self, settings): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we don't actually run this test suite in CI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, we don't. There's a comment above noting that I don't know how to best test this using the current integration tests, but created this local test to verify that it's working.
.unwrap_or_else(|| Duration::from_secs(0)) | ||
<= Duration::from_secs(state.settings.adm_timeout) | ||
{ | ||
HandlerErrorKind::AdmLoadError().into() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not totally convinced we need all this because we don't seem to generate too many AdmServerError
s AFAICT but I'm ok keeping it and seeing its actual usage.
Though I'm thinking AdmLoadError
should emit a metric instead of going to sentry (just needs a line added to HandlerErrorKind::metric_label
) since we know exactly what causes it. I'd say the the same for AdmServerError
(at least when e.is_timeout()
which is the majority of its cases) but let's worry about that later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I think AdmLoadError
should be a sentry event is because it's both significant and actionable, it's just that during the initial startup, there can be a lot of false events because we're swamping the sources. If we make it a metric, we're less inclined to take action. I am not going to make this a strong stand, though, so if we see different behaviors in the future, I'm find converting this to a metric.
Closes #336