Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Track when the pulled event signature fails #13815

Merged
1 change: 1 addition & 0 deletions changelog.d/13815.feature
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Keep track when an event pulled over federation fails its signature check so we can intelligently back-off in the future.
11 changes: 10 additions & 1 deletion synapse/federation/federation_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,13 @@ async def _check_sigs_and_hash(
InvalidEventSignatureError if the signature check failed. Nothing
will be logged in this case.
"""
await _check_sigs_on_pdu(self.keyring, room_version, pdu)
try:
await _check_sigs_on_pdu(self.keyring, room_version, pdu)
except Exception as exc:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This except clause seems too broad; I'd worry it might swallow errors arising from bugs rather than solely invalid signatures.

Can we narrow down the interesting types of Exceptions that we care about here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. this will catch CancelledErrors which sounds dodgy (I don't know if cancellation is enabled for backfill, but it doesn't sound like a good reason to record a failed pull attempt)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we can just worry about InvalidEventSignatureError here 👍

But all of the downstream places in _check_sigs_on_pdu do the same except Exception as e: 🤷

Copy link
Contributor Author

@MadLittleMods MadLittleMods Sep 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we decide here, we should also apply to #13814

Is there a way we can make the errors we care about opt-out instead of opt-in? It seems like it's going to suck to keep some of these error lists up to date whenever the downstream functions change and raise different errors.

Maybe it's better to be opt-in to handle just what you know but as an example at the following spot, we probably need to look at FederationError, RuntimeError, InvalidResponseError from a quick glance at some of the downstream code and this list seems like it will morph under us since there is so much going on in that block. Is there a good way to get a general sense of all exception types that a code path can raise (doesn't need to be comprehensive, maybe info just from docstrings)?

except Exception as exc:
await self._store.record_event_failed_pull_attempt(
event.room_id, event_id, str(exc)
)
raise exc

(created a draft PR to track this point, #13969, and will update with what we think about here)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't imagine opt-out is easier — I don't think RuntimeError, ValueError, AssertionError, KeyError etc should be included, for example. These represent programming errors rather than the remote not acting according to the way we expect.

Tracking the set of exceptions that some code can throw is difficult in Python, unfortunately.

I would suggest trying some cases of where you want to track a failed pull attempt, perhaps even adding tests that reproduce these cases. I can think of at least these classes of problems:

  • DNS errors
  • TCP-level Connection errors
  • TLS problems such as untrusted self-signed certs
  • Homeserver not on federation whitelist (? what should we do here?)
  • Bad response code
  • Response isn't JSON
  • JSON response isn't in an expected format (e.g. doesn't include the right keys)... bearing in mind that our validation is probably rubbish and this might currently manifest itself as a KeyError (IMO this should be fixed so it sounds more like a validation error)

I realise this sounds a bit painful but we ought to figure out what we should do for these cases. Should a network problem actually be tracked against the event at all; won't the existing backoff behaviour be sufficient there? ....

Perhaps, rather than accepting that the code underneath is always going to be messy, try and improve the downstream situation — can we define superclasses for some of these interesting groups of exceptions a e.g. TemporaryNetworkError exception class and make relevant exceptions derived from those or get re-raised as those?
We can then add to the docstrings of the downstream functions what exceptions they're allowed to raise for these kinds of circumstances.

Just some thoughts, realise it may not necessarily be easy..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think RuntimeError, ValueError, AssertionError, KeyError etc should be included, for example. These represent programming errors rather than the remote not acting according to the way we expect.

Fair enough on RuntimeError and friends, I saw this case and during my quick scan and seemed relevant (it's not) although if it was, would be a good candidate to make it throw the proper error type 👍

I would suggest trying some cases of where you want to track a failed pull attempt, perhaps even adding tests that reproduce these cases

Having these tests would be great. We'd want to apply them to /event/{eventId}, /event_auth/{roomId}/{eventId}, /state/{roomId}?event_id=$xxx, /state_ids/{roomId}?event_id=$xxx (although related and probably covered by the existing backoff and NotRetryingDestination in the next point below). For /backfill and /get_missing_events, we can't track server/network problems because there is no specific event we're requesting but we can track whether the events in their responses are rejected, etc.

Should a network problem actually be tracked against the event at all; won't the existing backoff behaviour be sufficient there? ....

In general, if anything goes wrong with trying to pull an event from another server, we should backoff and do it in the background next time (#13623).

The existing backoff underlying all of the federation requests (matrixfederationclient.py and NotRetryingDestination) seems like it could be sufficient for network related errors. I haven't experienced network problems being the cause of /messages being slow so I haven't put as much thought in this area.

Homeserver not on federation whitelist (? what should we do here?)

Seems like we shouldn't be asking homeservers if they aren't on our whitelist in the first place. I guess we just let FederationDeniedError handle this and continue to the next destination which accomplishes this though.


Thanks for the thoughtful response on this @reivilibre! Really helping me form the right thoughts around all of this!

await self.store.record_event_failed_pull_attempt(
pdu.room_id, pdu.event_id, str(exc)
MadLittleMods marked this conversation as resolved.
Show resolved Hide resolved
)
raise exc

if not check_event_content_hash(pdu):
# let's try to distinguish between failures because the event was
Expand Down Expand Up @@ -116,6 +122,9 @@ async def _check_sigs_and_hash(
"event_id": pdu.event_id,
}
)
await self.store.record_event_failed_pull_attempt(
pdu.room_id, pdu.event_id, "Event content has been tampered with"
)
return redacted_event

spam_check = await self.spam_checker.check_event_for_spam(pdu)
Copy link
Contributor Author

@MadLittleMods MadLittleMods Sep 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, we will potentially record a failed attempt when the spam checker soft-fails an event. Part of #13700

Expand Down