feat: Make the partner timeouts "softer" during initial cache loading. #337

jrconlin · 2021-12-14T23:29:03Z

Closes #336

src/error.rs

ncloudioj

r+ w/ a comment.

NOTE: this changes the behavior of the `CONTILE_TEST_MODE` flag to be one of several states: **TestFakeResponse** = return a fake response from the ADM server (previous "true") **TestTimeout** = "timeout" a request to the ADM server.

* Added "test_mode" to heartbeat to verify test mode

jrconlin · 2021-12-17T23:44:32Z

With f173d25, I expanded the CONTILE_TEST_MODE flag to indicate states to run the server. I'm not sure how to best specify that using the existing integration tests.

I updated the old, crappy one to illustrate how to check for it.

… into feat/336-no-500

pjenvey

I'm really sorry, I only glanced at this PR last month and thought it was doing something else.

The source of the 503 errors on startup is contile itself, the handler returns them when the cache contains a TilesState::Populating entry. Requests set that state when they have a request to adM inflight, preventing subsequent requests from needlessly making the same redundant adM request until the original finishes (#248).

With that change we've ended only getting a pretty number of AdmServerErrors (pretty much all are timeouts) -- they're all in sentry.

So I think we can just have that line in the handler return 204s instead.

…336-no-500

pjenvey · 2022-01-11T20:27:17Z

tools/test/integration_test.py

+        setup_module(test_mode="TestTimeout")
+        url = "{root}/v1/tiles".format(root=settings.get("test_url"))
+        resp = requests.get(url, headers=default_headers(test=""))
+        assert resp.status_code == 204


With AdmLoadError going through the error handling path we might as well also assert not resp.content. I think actix-web doesn't render anything for 204s despite our error handler setting a JSON response body but let's test it.

pjenvey · 2022-01-11T20:27:48Z

tools/test/integration_test.py

@@ -192,3 +200,12 @@ def test_bad_click_host(self, settings):
        assert len(tiles) == 2
        names = map(lambda tile: tile.get("name").lower(), tiles)
        assert list(names) == ["acme", "dunder mifflin"]
+
+    def test_aatimeout(self, settings):


It seems like we don't actually run this test suite in CI?

no, we don't. There's a comment above noting that I don't know how to best test this using the current integration tests, but created this local test to verify that it's working.

pjenvey · 2022-01-11T20:34:23Z

src/adm/tiles.rs

+                            .unwrap_or_else(|| Duration::from_secs(0))
+                            <= Duration::from_secs(state.settings.adm_timeout)
+                    {
+                        HandlerErrorKind::AdmLoadError().into()


I'm not totally convinced we need all this because we don't seem to generate too many AdmServerErrors AFAICT but I'm ok keeping it and seeing its actual usage.

Though I'm thinking AdmLoadError should emit a metric instead of going to sentry (just needs a line added to HandlerErrorKind::metric_label) since we know exactly what causes it. I'd say the the same for AdmServerError (at least when e.is_timeout() which is the majority of its cases) but let's worry about that later.

The reason I think AdmLoadError should be a sentry event is because it's both significant and actionable, it's just that during the initial startup, there can be a lot of false events because we're swamping the sources. If we make it a metric, we're less inclined to take action. I am not going to make this a strong stand, though, so if we see different behaviors in the future, I'm find converting this to a metric.

feat: Make the partner timeouts "softer" during initial cache loading.

2397506

Closes #336

jrconlin requested a review from a team December 14, 2021 23:29

ncloudioj reviewed Dec 16, 2021

View reviewed changes

src/error.rs Outdated Show resolved Hide resolved

ncloudioj previously approved these changes Dec 16, 2021

View reviewed changes

f return 204, brace for integration test

f173d25

NOTE: this changes the behavior of the `CONTILE_TEST_MODE` flag to be one of several states: **TestFakeResponse** = return a fake response from the ADM server (previous "true") **TestTimeout** = "timeout" a request to the ADM server.

jrconlin dismissed ncloudioj’s stale review via f173d25 December 17, 2021 00:58

f fixup stupid integration test to check 204 on timeout.

1bf299b

* Added "test_mode" to heartbeat to verify test mode

jrconlin and others added 3 commits January 4, 2022 14:10

Merge branch 'main' into feat/336-no-500

2de9e28

f ignore rustsec-2021-0131

a092ab1

Merge branch 'feat/336-no-500' of github.com:mozilla-services/contile…

94c7a2f

… into feat/336-no-500

pjenvey suggested changes Jan 10, 2022

View reviewed changes

Merge branch 'main' of github.com:mozilla-services/contile into feat/…

ef4e184

…336-no-500

jrconlin requested a review from pjenvey January 11, 2022 17:53

f missed merge conflict

92c2ace

pjenvey suggested changes Jan 11, 2022

View reviewed changes

f r's

7030fd3

pjenvey approved these changes Jan 11, 2022

View reviewed changes

jrconlin merged commit 99cacad into main Jan 11, 2022

jrconlin deleted the feat/336-no-500 branch January 11, 2022 23:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Make the partner timeouts "softer" during initial cache loading. #337

feat: Make the partner timeouts "softer" during initial cache loading. #337

jrconlin commented Dec 14, 2021

ncloudioj left a comment

jrconlin commented Dec 17, 2021

pjenvey left a comment •

edited

Loading

pjenvey Jan 11, 2022

pjenvey Jan 11, 2022

jrconlin Jan 11, 2022

pjenvey Jan 11, 2022

jrconlin Jan 11, 2022

feat: Make the partner timeouts "softer" during initial cache loading. #337

feat: Make the partner timeouts "softer" during initial cache loading. #337

Conversation

jrconlin commented Dec 14, 2021

ncloudioj left a comment

Choose a reason for hiding this comment

jrconlin commented Dec 17, 2021

pjenvey left a comment • edited Loading

Choose a reason for hiding this comment

pjenvey Jan 11, 2022

Choose a reason for hiding this comment

pjenvey Jan 11, 2022

Choose a reason for hiding this comment

jrconlin Jan 11, 2022

Choose a reason for hiding this comment

pjenvey Jan 11, 2022

Choose a reason for hiding this comment

jrconlin Jan 11, 2022

Choose a reason for hiding this comment

pjenvey left a comment •

edited

Loading