Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Limit the number of EDUs in transactions to 100 as expected by receiver #5138

Merged
merged 7 commits into from
May 9, 2019

Conversation

superboum
Copy link
Contributor

@superboum superboum commented May 4, 2019

Goal

I have observed this problem on my server:

2019-05-04 06:16:57,274 - synapse.federation.sender.transaction_manager - 87 - INFO - federation_transaction_transmission_loop-678370- TX [matrix.livda.fr] {1556878459543} Sending transaction [1556878459543], (PDUs: 0, EDUs: 103)
2019-05-04 06:16:57,338 - synapse.http.matrixfederationclient - 356 - INFO - federation_transaction_transmission_loop-678370- {PUT-O-448448} [matrix.livda.fr] Sending request: PUT matrix://matrix.livda.fr/_matrix/federation/v1/send/1556878459543; timeout 60.000000s
2019-05-04 06:16:57,339 - synapse.http.federation.matrix_federation_agent - 373 - INFO - federation_transaction_transmission_loop-678370- Connecting to matrix.livda.fr:443
[...]
2019-05-04 06:17:08,409 - synapse.http.matrixfederationclient - 388 - INFO - federation_transaction_transmission_loop-678370- {PUT-O-448448} [matrix.livda.fr] Got response headers: 400 Bad Request
2019-05-04 06:17:08,410 - synapse.http.matrixfederationclient - 472 - WARNING - federation_transaction_transmission_loop-678370- {PUT-O-448448} [matrix.livda.fr] Request failed: PUT matrix://matrix.livda.fr/_matrix/federation/v1/send/1556878459543: HttpResponseException("400: b'Bad Request'",)
2019-05-04 06:17:08,411 - synapse.federation.sender.transaction_manager - 123 - INFO - federation_transaction_transmission_loop-678370- TX [matrix.livda.fr] {1556878459543} got 400 response

The corresponding issue seems to be Synapse is not limiting the number of EDUs in a transaction #3951.
It is a blocker since Reject large transactions on federation #4513 as it blocks the federation between two servers.
In my case, the bug was caused by device_message_edus that was containing 103 entries.

I don't know synapse internals and just wrote this quick fix to bring back federations. I might have written some wrong/dangerous things.

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file
  • Pull request includes a sign off

superboum added 2 commits May 4, 2019 17:26
Signed-off-by: Quentin Dufour <quentin@deuxfleurs.fr>
Signed-off-by: Quentin Dufour <quentin@deuxfleurs.fr>
@superboum superboum marked this pull request as ready for review May 4, 2019 15:36
Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! A couple of thoughts below.

@@ -217,17 +217,21 @@ def _transaction_transmission_loop(self):
pending_edus = []

pending_edus.extend(self._get_rr_edus(force_flush=False))
pending_edus.extend(device_message_edus[:99])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can safely discard the other device_message_edus ?

pending_edus.extend(self._pop_pending_edus(100 - len(pending_edus)))
# But we should keep one slot free for presence
to_remove = []
for key, val in self._pending_edus_keyed.items():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be clearer written as a while loop:

while len(pending_edus) < 100:
    _, val = self._pending_edus_keyed.popitem()
    pending_edus.append(val)

@superboum
Copy link
Contributor Author

superboum commented May 7, 2019

Thanks for your reply. I will work on it tonight.
However, feel free to directly edit the PR if you need to apply this fix quickly.

Signed-off-by: Quentin Dufour <quentin@deuxfleurs.fr>
contents, stream_id = yield self._store.get_new_device_msgs_for_remote(
self._destination, last_device_stream_id, to_device_stream_id
last_device_list = self._last_device_list_stream_id
now_stream_id, results = yield self._store.get_devices_by_remote(
Copy link
Contributor Author

@superboum superboum May 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that get_devices_by_remote has a hard coded limit of 20 elements.
So we start with this request to be able to complete with the following one (this is why I swapped get_devices_by_remote and get_new_device_msgs_for_remote).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. I think it would be helpful to include a comment that says get_devices_by_remote will return no more than 20 entries.

Signed-off-by: Quentin Dufour <quentin@deuxfleurs.fr>
last_device_stream_id = self._last_device_stream_id
to_device_stream_id = self._store.get_to_device_stream_token()
contents, stream_id = yield self._store.get_new_device_msgs_for_remote(
self._destination, last_device_stream_id, to_device_stream_id, 98 - len(edus)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we can set a limit to fill the edus list. I chose 98 instead of 100 as presence and rr_edus could both take one slot after that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, rather than hardcode 98 here, it would be clearer to make _get_new_device_messages take a limit parameter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: the default value for the limit parameter on get_new_device_msgs_for_remote is now redundant. I suggest removing the default value for clarity.

@superboum
Copy link
Contributor Author

I have 4 failing unit tests for now (tox -e py37) :

tests.test_visibility.FilterEventsForServerTestCase.test_large_room
===============================================================================
[ERROR]
Traceback (most recent call last):
  File "/home/quentin/Documents/dev/synapse/tests/federation/test_federation_sender.py", line 46, in test_send_receipts
    mock_send_transaction.assert_called_once()
  File "/home/quentin/Documents/dev/synapse/.tox/py37/lib/python3.7/site-packages/mock/mock.py", line 918, in assert_called_once
    raise AssertionError(msg)
builtins.AssertionError: Expected 'send_transaction' to have been called once. Called 0 times.

tests.federation.test_federation_sender.FederationSenderTestCases.test_send_receipts
===============================================================================
[ERROR]
Traceback (most recent call last):
  File "/home/quentin/Documents/dev/synapse/tests/federation/test_federation_sender.py", line 81, in test_send_receipts_with_backoff
    mock_send_transaction.assert_called_once()
  File "/home/quentin/Documents/dev/synapse/.tox/py37/lib/python3.7/site-packages/mock/mock.py", line 918, in assert_called_once
    raise AssertionError(msg)
builtins.AssertionError: Expected 'send_transaction' to have been called once. Called 0 times.

tests.federation.test_federation_sender.FederationSenderTestCases.test_send_receipts_with_backoff
===============================================================================
[ERROR]
Traceback (most recent call last):
  File "/home/quentin/Documents/dev/synapse/tests/handlers/test_typing.py", line 195, in test_started_typing_remote_send
    try_trailing_slash_on_400=True,
  File "/home/quentin/Documents/dev/synapse/.tox/py37/lib/python3.7/site-packages/mock/mock.py", line 955, in assert_called_once_with
    raise AssertionError(msg)
builtins.AssertionError: Expected 'put_json' to be called once. Called 0 times.

tests.handlers.test_typing.TypingNotificationsTestCase.test_started_typing_remote_send
===============================================================================
[ERROR]
Traceback (most recent call last):
  File "/home/quentin/Documents/dev/synapse/tests/handlers/test_typing.py", line 273, in test_stopped_typing
    try_trailing_slash_on_400=True,
  File "/home/quentin/Documents/dev/synapse/.tox/py37/lib/python3.7/site-packages/mock/mock.py", line 955, in assert_called_once_with
    raise AssertionError(msg)
builtins.AssertionError: Expected 'put_json' to be called once. Called 0 times.

tests.handlers.test_typing.TypingNotificationsTestCase.test_stopped_typing
-------------------------------------------------------------------------------
Ran 635 tests in 471.275s

FAILED (skips=1, errors=4, successes=630)

I don't know if its due to my modifications. I will check that tomorrow.

@richvdh richvdh self-requested a review May 8, 2019 06:38
Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great. Thanks again for your work on it. Suggestion below which I think will fix the test failures, and a few other suggestions for minor cleanups.

pending_edus.extend(device_message_edus)
pending_edus.extend(self._pop_pending_edus(100 - len(pending_edus)))
while len(pending_edus) < 100:
_, val = self._pending_edus_keyed.popitem()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From _trial_temp/test.log when running the tests:

        Traceback (most recent call last):
          File "/home/rav/work/synapse/synapse/federation/sender/per_destination_queue.py", line 244, in _transaction_transmission_loop
            _, val = self._pending_edus_keyed.popitem()
        KeyError: 'popitem(): dictionary is empty'

Apologies, this was my fault. I suggest adding a check that _pending_edus_keyed is not empty to the while condition.

@@ -248,6 +238,12 @@ def _transaction_transmission_loop(self):
)
)

pending_edus.extend(device_message_edus)
pending_edus.extend(self._pop_pending_edus(100 - len(pending_edus)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this 100 number is starting to appear in a lot of places. Could I suggest creating a constant MAX_EDUS_PER_TRANSACTION in this file?

contents, stream_id = yield self._store.get_new_device_msgs_for_remote(
self._destination, last_device_stream_id, to_device_stream_id
last_device_list = self._last_device_list_stream_id
now_stream_id, results = yield self._store.get_devices_by_remote(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. I think it would be helpful to include a comment that says get_devices_by_remote will return no more than 20 entries.

last_device_stream_id = self._last_device_stream_id
to_device_stream_id = self._store.get_to_device_stream_token()
contents, stream_id = yield self._store.get_new_device_msgs_for_remote(
self._destination, last_device_stream_id, to_device_stream_id, 98 - len(edus)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, rather than hardcode 98 here, it would be clearer to make _get_new_device_messages take a limit parameter.

last_device_stream_id = self._last_device_stream_id
to_device_stream_id = self._store.get_to_device_stream_token()
contents, stream_id = yield self._store.get_new_device_msgs_for_remote(
self._destination, last_device_stream_id, to_device_stream_id, 98 - len(edus)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: the default value for the limit parameter on get_new_device_msgs_for_remote is now redundant. I suggest removing the default value for clarity.

Signed-off-by: Quentin Dufour <quentin@deuxfleurs.fr>
@superboum
Copy link
Contributor Author

superboum commented May 8, 2019

Thanks a lot for your extensive and precise review. Now tests are passing locally.

  • I added a check for self._pending_edus_keyed directly in the while condition
  • I created a constant as suggested and replaced the hardcoded value by the constant.
  • I added the comment to mention that it will not return more than 20 entries (as I am not very familiar with the code, I am not 100% sure it is the case, so if you have any doubt, we should check before merging)
  • I added a limit parameter to _get_new_device_messages instead of hardcoding 98. I call this function with MAX_EDUS_PER_TRANSACTION - 2 and added a comment to explain the - 2 part.
  • I removed the default value for the limit

However I still have a concern, the limit passed to _get_new_device_messages will be enforced only if limit >= 20 : as the first call is hardcoded to 20, it may return 20 values whatever the limit value is. Even worse, it could result in a negative limit in the second call. Can we suppose that it will not be called with a value less than 20? Should we add a check?

@richvdh
Copy link
Member

richvdh commented May 8, 2019

However I still have a concern, the limit passed to _get_new_device_messages will be enforced only if limit >= 20 : as the first call is hardcoded to 20, it may return 20 values whatever the limit value is. Even worse, it could result in a negative limit in the second call. Can we suppose that it will not be called with a value less than 20? Should we add a check?

well, given we've just hardcoded it to be called with 98, I think it's a valid assumption for now.

I think an easy way to sanity-check this would be to add:

assert len(edus) <= limit, "get_devices_by_remote returned too many EDUs"

@richvdh
Copy link
Member

richvdh commented May 8, 2019

as I am not very familiar with the code, I am not 100% sure it is the case, so if you have any doubt, we should check before

[I checked the code. I am satisfied that it will never return more than 20 rows, however I did discover #5153. That is a separate problem, though...]

@superboum
Copy link
Contributor Author

To be sure, is the PR blocked on your side (still some review to do, internal process, etc.) or on my side (you expect some modifications, like the assert one even if the assumption is valid for now)?

@richvdh
Copy link
Member

richvdh commented May 8, 2019

If you could add the assertion that would be great.

…_devices_by_remote

Signed-off-by: Quentin Dufour <quentin@deuxfleurs.fr>
Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@richvdh richvdh merged commit 11ea167 into matrix-org:develop May 9, 2019
anoadragon453 added a commit that referenced this pull request May 10, 2019
* develop: (45 commits)
  URL preview blacklisting fixes (#5155)
  Revert 085ae34
  Add a DUMMY stage to captcha-only registration flow
  Make Prometheus snippet less confusing on the metrics collection doc (#4288)
  Set syslog identifiers in systemd units (#5023)
  Run Black on the tests again (#5170)
  Add AllowEncodedSlashes to apache (#5068)
  remove instructions for jessie installation (#5164)
  Run `black` on per_destination_queue
  Limit the number of EDUs in transactions to 100 as expected by receiver (#5138)
  Fix bogus imports in tests (#5154)
  add options to require an access_token to GET /profile and /publicRooms on CS API (#5083)
  Do checks on aliases for incoming m.room.aliases events (#5128)
  Remove the requirement to authenticate for /admin/server_version. (#5122)
  Fix spelling in server notices admin API docs (#5142)
  Fix sample config
  0.99.3.2
  include disco in deb build target list
  changelog
  Debian: we now need libpq-dev.
  ...
anoadragon453 added a commit that referenced this pull request May 10, 2019
* develop:
  Revert 085ae34
  Add a DUMMY stage to captcha-only registration flow
  Make Prometheus snippet less confusing on the metrics collection doc (#4288)
  Set syslog identifiers in systemd units (#5023)
  Run Black on the tests again (#5170)
  Add AllowEncodedSlashes to apache (#5068)
  remove instructions for jessie installation (#5164)
  Run `black` on per_destination_queue
  Limit the number of EDUs in transactions to 100 as expected by receiver (#5138)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants