Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Conversation

MadLittleMods
Copy link
Contributor

@MadLittleMods MadLittleMods commented Aug 3, 2022

Part of #13356

Combine:

So that I can run against Complement federation tests and see if there is more to add @trace to in the federation stack of things when /messages happens.

Optimization ideas

We load a lot of state (from 2. in #13356)

In #matrixhq there are 40k current members and I assume get_current_state is the root cause why we Loaded 79277 events (seems like that took 17s too). We only call get_current_state in order to get a list of likely domains to backfill from.

We could optimize this by:

  1. Cache get_domains_from_state so we don't have to get_current_state as much
  2. Use a smarter heuristic to get the first couple domains to try to fetch from and do the get_domains_from_state in the background so it's ready by the time we fail with the first couple of domains.

Skip backfill

Skip backfill or kick it off in the background if it's not our first time and we have enough events.

We don't want to get stuck on the same unfetchable event over and over.

Why is /state_ids slow to respond?

We can't control every bad network effect but maybe Synapse is slow to assemble a /state_ids reponse 🤔 Need to investigate

FederationStateIdsServlet

We should only care about auth_event_ids

We should only care about getting the event_id and auth_event_ids in _get_state_ids_after_missing_prev_event(...)

We shouldn't factor state_event_ids into whether

Dev notes

TEST_ONLY_IGNORE_POETRY_LOCKFILE=1 TEST_ONLY_SKIP_DEP_HASH_VERIFICATION=1 COMPLEMENT_ALWAYS_PRINT_SERVER_LOGS=1 COMPLEMENT_DIR=../complement ./scripts-dev/complement.sh -run TestMessagesOverFederation -p 1
TEST_ONLY_IGNORE_POETRY_LOCKFILE=1 TEST_ONLY_SKIP_DEP_HASH_VERIFICATION=1 COMPLEMENT_ALWAYS_PRINT_SERVER_LOGS=1 COMPLEMENT_DIR=../complement ./scripts-dev/complement.sh -run TestImportHistoricalMessages/parallel/Federation/Backfill_still_works_after_many_batches_are_imported -p 1
poetry run synapse_homeserver --config-path homeserver.yaml

Jaeger max duration spans

213503982d 8h, see #13440 (comment)

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Pull request includes a sign off
  • Code style is correct
    (run the linters)

@MadLittleMods MadLittleMods added the T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. label Aug 3, 2022
Comment on lines +201 to +203
# It does not seem like the agent can keep up with the massive UDP load
# (1065 spans in one trace) so lets just use the HTTP collector endpoint
# instead which seems to work.
Copy link
Contributor Author

@MadLittleMods MadLittleMods Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why this is the case? I was seeing this same behavior with the Jaeger opentracing stuff. Is the UDP connection being over saturated? Can the Jaeger agent in Docker not keep up? We see some spans come over but never the main servlet overarching one that is probably the last to be exported.

But using the HTTP Jaeger collector endpoint seems to work fine for getting the whole trace.

…ittlemods/13356-messages-investigation-scratch-v1

Conflicts:
	pyproject.toml
	synapse/logging/tracing.py
MadLittleMods added a commit that referenced this pull request Aug 16, 2022
Instrument the federation/backfill part of `/messages` so it's easier to follow what's going on in Jaeger when viewing a trace.

Split out from #13440

Follow-up from #13368

Part of #13356
…ittlemods/13356-messages-investigation-scratch-v1

Conflicts:
	synapse/federation/federation_client.py
	synapse/handlers/federation.py
	synapse/handlers/federation_event.py
	synapse/logging/tracing.py
	synapse/storage/controllers/persist_events.py
	synapse/storage/controllers/state.py
	synapse/storage/databases/main/events_worker.py
	synapse/util/ratelimitutils.py
…ittlemods/13356-messages-investigation-scratch-v1

Conflicts:
	poetry.lock
	synapse/handlers/federation.py
…ittlemods/13356-messages-investigation-scratch-v1
@clokep
Copy link
Member

clokep commented Oct 19, 2022

@MadLittleMods Is this useful or have you gleaned everything you can from it?

@MadLittleMods
Copy link
Contributor Author

@clokep It's not as useful for its original purpose. But it's being depended on by #13864 which I have used recently

This PR is useful because it is based on #13400 which has the OpenTelemetry changes. And this PR adds the Docker Complement changes to see traces from the tests

…ittlemods/13356-messages-investigation-scratch-v1

Conflicts:
	synapse/handlers/federation.py
	synapse/handlers/relations.py
@MadLittleMods MadLittleMods added the A-Messages-Endpoint /messages client API endpoint (`RoomMessageListRestServlet`) (which also triggers /backfill) label Apr 25, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Messages-Endpoint /messages client API endpoint (`RoomMessageListRestServlet`) (which also triggers /backfill) T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants