Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Return request ID in HTTP response headers #10854

Closed
wants to merge 6 commits into from

Conversation

DMRobertson
Copy link
Contributor

We already have something like this when opentracing is enabled, see #10199. But that's only across federation.

When investigating test failures it's really useful to cross-reference HTTP responses with synapse logs. I propose exposing the request id to facilitate that. I've tried to change the request ids to be unique across all workers now, by including an instance name.

See also matrix-org/sytest#1144.

David Robertson added 2 commits September 20, 2021 15:04
and include this information in the HTTP response headers
Comment on lines -86 to +98
self._processing_finished_time = None
self._processing_finished_time: Optional[float] = None

# what time we finished sending the response to the client (or the connection
# dropped)
self.finish_time = None
self.finish_time: Optional[float] = None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why mypy started being unhappy, but it was interpreting these as being of type NoneType throughout the program, causing problems later.

@DMRobertson DMRobertson requested a review from a team September 20, 2021 14:23
@DMRobertson DMRobertson removed the request for review from a team September 20, 2021 14:52
David Robertson added 3 commits September 20, 2021 17:00
@DMRobertson DMRobertson requested a review from a team September 20, 2021 17:27
@clokep
Copy link
Member

clokep commented Sep 20, 2021

But that's only across federation.

I don't think this is true. inject_response_headers gets called from two places:

  • BaseFederationServlet -- used by federation code as you mention.
  • trace_servlet -- which is called by _AsyncResource which is how most of the C-S API is implemented.

It is only used if opentracing is enabled (and I'm not 100% sure what is included as the header).

@reivilibre reivilibre assigned reivilibre and unassigned reivilibre Sep 21, 2021
Copy link
Contributor

@reivilibre reivilibre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if Patrick's comment above invalidates this PR or not, but I took a look.

Not entirely sure about the potential for log pollution; compare POST-424242 vs worker-event_persister18/POST-424242, especially when the worker ID is redundant/implied in the log file itself (as far as I know?).

@@ -0,0 +1 @@
Include a request id in Synapse's HTTP responses to aid debugging.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Include a request id in Synapse's HTTP responses to aid debugging.
Include a request ID in Synapse's HTTP responses to aid debugging.

def get_request_id(self):
return "%s-%i" % (self.get_method(), self.request_seq)
def get_request_id(self) -> str:
return f"{self._instance_name}/{self.get_method()}-{self.request_seq}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the same thing that winds up in the logs?
I can see this perhaps making things a bit noisier, but not sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is what goes in the logs. I'm not enthusiastic about adding this noise to them.

Comment on lines +384 to +386
# TODO can we avoid the cast? Maybe take finish_time as an explicit float param?
response_send_time = (
cast(float, self.finish_time) - self._processing_finished_time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is slightly less evil.

Suggested change
# TODO can we avoid the cast? Maybe take finish_time as an explicit float param?
response_send_time = (
cast(float, self.finish_time) - self._processing_finished_time
assert self.finish_time is not None
response_send_time = (
self.finish_time - self._processing_finished_time

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Or raise an explicit Exception.

@@ -188,7 +188,7 @@ def test_with_request_context(self):
]
self.assertCountEqual(log.keys(), expected_log_keys)
self.assertEqual(log["log"], "Hello there, wally!")
self.assertTrue(log["request"].startswith("POST-"))
self.assertIn("POST-", log["request"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or vs:

Suggested change
self.assertIn("POST-", log["request"])
self.assertTrue(log["request"].startswith("worker-test/POST-"))

@reivilibre reivilibre requested a review from a team September 21, 2021 09:30
@DMRobertson
Copy link
Contributor Author

DMRobertson commented Sep 21, 2021

I don't know if Patrick's comment above invalidates this PR or not, but I took a look.

It sounds like it does.

Not entirely sure about the potential for log pollution; compare POST-424242 vs worker-event_persister18/POST-424242, especially when the worker ID is redundant/implied in the log file itself (as far as I know?).

What I want here is to be able quickly backtrack from a client HTTP response to the corresponding server-side processing. I've found myself wanting this when investigating sytests. A sytest run makes thousands of requests and it's a pain to have to cross-reference timestamps---never mind working out which worker's log file contains the relevant logs. I want a unique string I can bang into grep -r or rg and just find the logs I'm after.

I think this is useful to have, even in the absence of full opentracing telemetry. It might be that

@reivilibre
Copy link
Contributor

I want a unique string I can bang into grep -r or rg and just find the logs I'm after.

I think this is a fair use-case. I personally feel somewhat inclined to say that the request ID in the logs themselves shouldn't make mention of the worker ID; but the header should.

If you make the header format X-Request-Id: POST-424242 event_persister18, then a rg POST-424242 event_persister18.log doesn't seem so bad (assuming the sytest log file names follow a sensible pattern like that).

I think thoughts from someone on the team who does a bit more of that might be useful.

@richvdh
Copy link
Member

richvdh commented Sep 23, 2021

I'm afraid I'm a little unconvinced that we want to be adding this header as well as the existing Synapse-Trace-Id header, which, as @clokep has mentioned, is returned for C-S requests provided opentracing is enabled (even if the opentracing span doesn't end up being sent on to the Jaeger collector).

In order for Synapse-Trace-Id to be useful for your usecase, I think two things are needed:

  • We should log the value of the Synapse-Trace-Id header in the final "Processed" line for each request. I think you can just fish it out of self.responseHeaders in _finished_processing
  • We need to actually give requests a trace-id when running in sytest. One way to do this is to enable opentracing under sytest; another is to update synapse.logging.opentracing.trace_servlet and BaseFederationServlet to generate a trace ID internally even if opentracing isn't enabled. I think the latter would be generally useful and would favour that?

That still leaves you with the problem of figuring out which worker handled a particular request. A few thoughts here:

  • can you just grep across the likely logfiles for the right Trace-Id? Generally it shouldn't be too hard to find once you figure out how your Synapse instance is routing the different types of request?
  • we could include the worker name in another response header?

@@ -377,6 +377,7 @@ def _listen_http(self, listener_config: ListenerConfig):
self.version_string,
max_request_body_size=max_request_body_size(self.config),
reactor=self.get_reactor(),
instance_name=f"worker-{site_tag}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think site_tag is quite what we want, is it? typically it's just the port number.

def get_request_id(self):
return "%s-%i" % (self.get_method(), self.request_seq)
def get_request_id(self) -> str:
return f"{self._instance_name}/{self.get_method()}-{self.request_seq}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is what goes in the logs. I'm not enthusiastic about adding this noise to them.

Comment on lines +384 to +386
# TODO can we avoid the cast? Maybe take finish_time as an explicit float param?
response_send_time = (
cast(float, self.finish_time) - self._processing_finished_time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Or raise an explicit Exception.

@DMRobertson
Copy link
Contributor Author

DMRobertson commented Oct 4, 2021

Many thanks all.

another is to update synapse.logging.opentracing.trace_servlet and BaseFederationServlet to generate a trace ID internally even if opentracing isn't enabled. I think the latter would be generally useful and would favour that?

Absolutely, this sounds ideal. And to be explicit, I'd include this header in the response even if opentracing was disabled.

  • can you just grep across the likely logfiles for the right Trace-Id? Generally it shouldn't be too hard to find once you figure out how your Synapse instance is routing the different types of request?

That's fine by me, assuming the trace IDs are unique across workers (guessing they're a uuid of some kind?).

@DMRobertson
Copy link
Contributor Author

I'm going to raise a new issue to summarise the discussion and close this. Many thanks all.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants