Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Only try to backfill event if we haven't tried before recently #13622

Closed
Tracked by #14284
MadLittleMods opened this issue Aug 25, 2022 · 0 comments · Fixed by #13635
Closed
Tracked by #14284

Only try to backfill event if we haven't tried before recently #13622

MadLittleMods opened this issue Aug 25, 2022 · 0 comments · Fixed by #13635
Assignees
Labels
A-Federation A-Messages-Endpoint /messages client API endpoint (`RoomMessageListRestServlet`) (which also triggers /backfill) A-Performance Performance, both client-facing and admin-facing S-Tolerable Minor significance, cosmetic issues, low or no impact to users. T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements.

Comments

@MadLittleMods
Copy link
Contributor

MadLittleMods commented Aug 25, 2022

Only try to backfill event if we haven't tried before recently. No need to keep trying the same backfill point that fails over and over. Perhaps an exponential back-off strategy.

Follow-up to #13621

Part of making /messages faster: #13356

Related to:

@MadLittleMods MadLittleMods added A-Federation A-Messages-Endpoint /messages client API endpoint (`RoomMessageListRestServlet`) (which also triggers /backfill) labels Aug 25, 2022
@DMRobertson DMRobertson added S-Tolerable Minor significance, cosmetic issues, low or no impact to users. T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. A-Performance A-Performance Performance, both client-facing and admin-facing and removed A-Performance labels Aug 25, 2022
@MadLittleMods MadLittleMods self-assigned this Aug 26, 2022
MadLittleMods added a commit that referenced this issue Sep 14, 2022
We can follow-up this PR with:

 1. Only try to backfill from an event if we haven't tried recently -> #13622
 1. When we decide to backfill that event again, process it in the background so it doesn't block and make `/messages` slow when we know it will probably fail again -> #13623
 1. Generally track failures everywhere we try and fail to pull an event over federation -> #13700

Fix #13621

Part of #13356

Mentioned in [internal doc](https://docs.google.com/document/d/1lvUoVfYUiy6UaHB6Rb4HicjaJAU40-APue9Q4vzuW3c/edit#bookmark=id.qv7cj51sv9i5)
MadLittleMods added a commit that referenced this issue Sep 15, 2022
…e is invalid

Related to
 - #13622
    - #13635
 - #13676

Follow-up to #13815
which tracks event signature failures.

This PR aims to stop us from trying to
`_get_state_ids_after_missing_prev_event` after
we already know that the prev_event will fail
from a previous attempt

To explain an exact scenario around `/messages` -> backfill, we call `/backfill` and first check the signatures of the 100 events. We see bad signature for `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` and `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` (both member events). Then we process the 98 events remaining that have valid signatures but one of the events references `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` as a `prev_event`. So we have to do the whole `_get_state_ids_after_missing_prev_event` rigmarole which pulls in those same events which fail again because the signatures are still invalid.

 - `backfill`
    - `outgoing-federation-request` `/backfill`
    - `_check_sigs_and_hash_and_fetch`
       - `_check_sigs_and_hash_and_fetch_one` for each event received over backfill
          - ❗ `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` fails with `Signature on retrieved event was invalid.`: `unable to verify signature for sender domain xxx: 401: Failed to find any key to satisfy: _FetchKeyRequest(...)`
          - ❗ `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` fails with `Signature on retrieved event was invalid.`: `unable to verify signature for sender domain xxx: 401: Failed to find any key to satisfy: _FetchKeyRequest(...)`
   - `_process_pulled_events`
      - `_process_pulled_event` for each validated event
         - ❗ Event `$Q0iMdqtz3IJYfZQU2Xk2WjB5NDF8Gg8cFSYYyKQgKJ0` references `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` as a `prev_event` which is missing so we try to get it
            - `_get_state_ids_after_missing_prev_event`
               - `outgoing-federation-request` `/state_ids`
               - ❗ `get_pdu` for `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` which fails the signature check again
               - ❗ `get_pdu` for `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` which fails the signature check

With this PR, we no longer call the burdensome `_get_state_ids_after_missing_prev_event`
because the signature failure will count as an attempt before we try to run this.
MadLittleMods added a commit that referenced this issue Sep 23, 2022
Only try to backfill event if we haven't tried before recently (exponential backoff). No need to keep trying the same backfill point that fails over and over.

Fix #13622
Fix #8451

Follow-up to #13589

Part of #13356
MadLittleMods added a commit that referenced this issue Oct 15, 2022
…ure is invalid (#13816)

While #13635 stops us from doing the slow thing after we've already done it once, this PR stops us from doing one of the slow things in the first place.

Related to
 - #13622
    - #13635
 - #13676

Part of #13356

Follow-up to #13815 which tracks event signature failures.

With this PR, we avoid the call to the costly `_get_state_ids_after_missing_prev_event` because the signature failure will count as an attempt before and we filter events based on the backoff before calling `_get_state_ids_after_missing_prev_event` now.

For example, this will save us 156s out of the 185s total that this `matrix.org` `/messages` request. If you want to see the full Jaeger trace of this, you can drag and drop this `trace.json` into your own Jaeger, https://gist.github.com/MadLittleMods/4b12d0d0afe88c2f65ffcc907306b761

To explain this exact scenario around `/messages` -> backfill, we call `/backfill` and first check the signatures of the 100 events. We see bad signature for `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` and `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` (both member events). Then we process the 98 events remaining that have valid signatures but one of the events references `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` as a `prev_event`. So we have to do the whole `_get_state_ids_after_missing_prev_event` rigmarole which pulls in those same events which fail again because the signatures are still invalid.

 - `backfill`
    - `outgoing-federation-request` `/backfill`
    - `_check_sigs_and_hash_and_fetch`
       - `_check_sigs_and_hash_and_fetch_one` for each event received over backfill
          - ❗ `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` fails with `Signature on retrieved event was invalid.`: `unable to verify signature for sender domain xxx: 401: Failed to find any key to satisfy: _FetchKeyRequest(...)`
          - ❗ `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` fails with `Signature on retrieved event was invalid.`: `unable to verify signature for sender domain xxx: 401: Failed to find any key to satisfy: _FetchKeyRequest(...)`
   - `_process_pulled_events`
      - `_process_pulled_event` for each validated event
         - ❗ Event `$Q0iMdqtz3IJYfZQU2Xk2WjB5NDF8Gg8cFSYYyKQgKJ0` references `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` as a `prev_event` which is missing so we try to get it
            - `_get_state_ids_after_missing_prev_event`
               - `outgoing-federation-request` `/state_ids`
               - ❗ `get_pdu` for `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` which fails the signature check again
               - ❗ `get_pdu` for `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` which fails the signature check
MadLittleMods added a commit that referenced this issue May 12, 2023
Fix #13623

Follow-up to #13621 and #13622

Part of making `/messages` faster: #13356
MadLittleMods added a commit that referenced this issue May 25, 2023
Process previously failed backfill events in the background because they are bound to fail again and we don't need to waste time holding up the request for something that is bound to fail again.

Fix #13623

Follow-up to #13621 and #13622

Part of making `/messages` faster: #13356
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Federation A-Messages-Endpoint /messages client API endpoint (`RoomMessageListRestServlet`) (which also triggers /backfill) A-Performance Performance, both client-facing and admin-facing S-Tolerable Minor significance, cosmetic issues, low or no impact to users. T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements.
Projects
None yet
2 participants