Skip to content
This repository has been archived by the owner on Jul 26, 2024. It is now read-only.

feat: Make the partner timeouts "softer" during initial cache loading. #337

Merged
merged 9 commits into from
Jan 11, 2022

Conversation

jrconlin
Copy link
Member

Closes #336

@jrconlin jrconlin requested a review from a team December 14, 2021 23:29
src/error.rs Outdated Show resolved Hide resolved
ncloudioj
ncloudioj previously approved these changes Dec 16, 2021
Copy link
Collaborator

@ncloudioj ncloudioj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r+ w/ a comment.

NOTE:
this changes the behavior of the `CONTILE_TEST_MODE` flag to be one of
several states:

**TestFakeResponse** = return a fake response from the ADM server
(previous "true")
**TestTimeout** = "timeout" a request to the ADM server.
* Added "test_mode" to heartbeat to verify test mode
@jrconlin
Copy link
Member Author

With f173d25, I expanded the CONTILE_TEST_MODE flag to indicate states to run the server. I'm not sure how to best specify that using the existing integration tests.

I updated the old, crappy one to illustrate how to check for it.

Copy link
Member

@pjenvey pjenvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really sorry, I only glanced at this PR last month and thought it was doing something else.

The source of the 503 errors on startup is contile itself, the handler returns them when the cache contains a TilesState::Populating entry. Requests set that state when they have a request to adM inflight, preventing subsequent requests from needlessly making the same redundant adM request until the original finishes (#248).

With that change we've ended only getting a pretty number of AdmServerErrors (pretty much all are timeouts) -- they're all in sentry.

So I think we can just have that line in the handler return 204s instead.

@jrconlin jrconlin requested a review from pjenvey January 11, 2022 17:53
setup_module(test_mode="TestTimeout")
url = "{root}/v1/tiles".format(root=settings.get("test_url"))
resp = requests.get(url, headers=default_headers(test=""))
assert resp.status_code == 204
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With AdmLoadError going through the error handling path we might as well also assert not resp.content. I think actix-web doesn't render anything for 204s despite our error handler setting a JSON response body but let's test it.

@@ -192,3 +200,12 @@ def test_bad_click_host(self, settings):
assert len(tiles) == 2
names = map(lambda tile: tile.get("name").lower(), tiles)
assert list(names) == ["acme", "dunder mifflin"]

def test_aatimeout(self, settings):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we don't actually run this test suite in CI?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, we don't. There's a comment above noting that I don't know how to best test this using the current integration tests, but created this local test to verify that it's working.

.unwrap_or_else(|| Duration::from_secs(0))
<= Duration::from_secs(state.settings.adm_timeout)
{
HandlerErrorKind::AdmLoadError().into()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not totally convinced we need all this because we don't seem to generate too many AdmServerErrors AFAICT but I'm ok keeping it and seeing its actual usage.

Though I'm thinking AdmLoadError should emit a metric instead of going to sentry (just needs a line added to HandlerErrorKind::metric_label) since we know exactly what causes it. I'd say the the same for AdmServerError (at least when e.is_timeout() which is the majority of its cases) but let's worry about that later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I think AdmLoadError should be a sentry event is because it's both significant and actionable, it's just that during the initial startup, there can be a lot of false events because we're swamping the sources. If we make it a metric, we're less inclined to take action. I am not going to make this a strong stand, though, so if we see different behaviors in the future, I'm find converting this to a metric.

@jrconlin jrconlin merged commit 99cacad into main Jan 11, 2022
@jrconlin jrconlin deleted the feat/336-no-500 branch January 11, 2022 23:06
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Do not return 500 error when filling cache
4 participants