Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up update_local_storage in sync #1024

Closed
wants to merge 1 commit into from
Closed

Speed up update_local_storage in sync #1024

wants to merge 1 commit into from

Conversation

rmol
Copy link
Contributor

@rmol rmol commented Mar 27, 2020

Description

Stop importing source GPG keys during sync. We have a place on Source to store fingerprint and key, which is all we need for the import, and which is what we check when deciding whether to enable the reply box, so let's just update the database here and defer import to the keyring until someone actually wants to reply. It takes milliseconds that won't be noticed then, but do add up during sync.

Avoid dirtying synced objects on update unless they've actually changed.

Use maps instead of sets to hold local objects in the update_ functions, so it's faster to check if we already have a record of incoming objects.

Reduce the number of database queries during sync: cache sources or journalists instead of looking them up for each incoming object associated with them.

Note that this stops short of switching to SQLAlchemy's bulk operations, so we may be able to speed update_local_storage up more when we can take the time to do that.

Fixes #1009.

Test Plan

  • Start the dev server with NUM_SOURCES=1000 (reduce key size to 1024 in securedrop/crypto_util.py line 123 to reduce startup time).
  • Check out this branch.
  • Run the client with LOGLEVEL=debug ./run.sh --sdc-home ~/.securedrop_client
  • The source list should populate within about 30 seconds, and you should be able to reply to sources even though you can't yet read their messages or previous replies, so go ahead and test that.
  • Optionally, check the durations of update_local_storage operations with grep duration ~/.securedrop_client/logs/client.log. None should take longer than about seven seconds with 1000 sources, and most should be quite a bit faster than that.

Checklist

If these changes modify code paths involving cryptography, the opening of files in VMs or network (via the RPC service) traffic, Qubes testing in the staging environment is required. For fine tuning of the graphical user interface, testing in any environment in Qubes is required. Please check as applicable:

  • I have tested these changes in the appropriate Qubes environment
  • I do not have an appropriate Qubes OS workstation set up (the reviewer will need to test these changes)
  • These changes should not need testing in Qubes

If these changes add or remove files other than client code, packaging logic (e.g., the AppArmor profile) may need to be updated. Please check as applicable:

  • I have submitted a separate PR to the packaging repo
  • No update to the packaging logic (e.g., AppArmor profile) is required for these changes
  • I don't know and would appreciate guidance

Stop importing source GPG keys during sync. We have a place on Source
to store fingerprint and key, which is all we need for the import, and
which is what we check when deciding whether to enable the reply box,
so let's just update the database here and defer import to the keyring
until someone actually wants to reply. It takes milliseconds that
won't be noticed then, but do add up during sync.

Avoid dirtying synced objects on update unless they've actually
changed.

Use maps instead of sets to hold local objects in the update_
functions, so it's faster to check if we already have a record of
incoming objects.

Reduce the number of database queries during sync: cache sources or
journalists instead of looking them up for each incoming object
associated with them.
@sssoleileraaa
Copy link
Contributor

Nice changes! Could you break this up into separate PRs for each significant change to make it easier to review and test to make sure there are no regressions? Here's what I was thinking (no particular order, but we should start with the biggest performance improvement first):

PR 1:

Stop importing source GPG keys during sync. We have a place on Source to store fingerprint and key, which is all we need for the import, and which is what we check when deciding whether to enable the reply box, so let's just update the database here and defer import to the keyring until someone actually wants to reply. It takes milliseconds that won't be noticed then, but do add up during sync.

PR 2:

Avoid dirtying synced objects on update unless they've actually changed.

PR 3:

Use maps instead of sets to hold local objects in the update_ functions, so it's faster to check if we already have a record of incoming objects.

PR 4:

Reduce the number of database queries during sync: cache sources or journalists instead of looking them up for each incoming object associated with them.

PR 5:

Note that this stops short of switching to SQLAlchemy's bulk operations, so we may be able to speed update_local_storage up more when we can take the time to do that.

@sssoleileraaa
Copy link
Contributor

sssoleileraaa commented Mar 30, 2020

The above is just an idea of how to split it. My guess is that PR 1 will be the most significant improvement but will also require the most code review and testing.

@rmol
Copy link
Contributor Author

rmol commented Mar 31, 2020

Closing in favor of #1034, #1035, #1036, and #1037, which should be reviewed and merged in that order. I'll rebase each in turn to reduce the diffs.

The progression of update_local_storage execution time in my testing with 1000 sources looked like this:

#1034 (add-chronometer, which should perform like master)

Function Duration (seconds)
update_sources duration 93.5126
update_files duration 0.0166
update_messages duration 8.5065
update_replies duration 19.6221

#1035 (defer-source-key-import)

Function Duration (seconds)
update_sources duration 21.4783
update_files duration 0.0125
update_messages duration 9.3966
update_replies duration 20.2071

#1036 (minimize-sync-db-work)

Function Duration (seconds)
update_sources duration 0.6821
update_files duration 0.0049
update_messages duration 4.8876
update_replies duration 10.5937

#1037 (minimize-sync-lookups)

Function Duration (seconds)
update_sources duration 0.5708
update_files duration 0.0046
update_messages duration 2.0827
update_replies duration 6.9603

@rmol rmol closed this Mar 31, 2020
@rmol rmol deleted the speedsync branch March 31, 2020 20:43
@rmol rmol mentioned this pull request Apr 1, 2020
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

investigate how to speed up update_sources
2 participants