Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure UI remains responsive at all times during sync with server. #733

Merged
merged 1 commit into from
Jan 29, 2020

Conversation

ntoll
Copy link
Contributor

@ntoll ntoll commented Jan 27, 2020

Description

Fixes #716

DO NOT MERGE this is a work in progress while further investigation takes place.

The initial removal of update_sources on message download ensures the UI remains responsive.

However, given a server with 200 sources, while the UI is now responsive, I no longer see the results I was able to see with far fewer sources (50) in previous tests. The server activity looks like this:

172.17.0.1 - - [27/Jan/2020 11:51:51] "POST /api/v1/token HTTP/1.1" 200 -
172.17.0.1 - - [27/Jan/2020 11:52:20] "GET /api/v1/sources HTTP/1.1" 200 -
172.17.0.1 - - [27/Jan/2020 11:52:43] "GET /api/v1/sources HTTP/1.1" 200 -
2020-01-27 11:52:45,411 INFO Clearing shredder
2020-01-27 11:52:45,411 INFO Files to delete: 0

Which suggests something is pinging the API every 20 seconds or so. Not sure what though. I also notice from top that the client is taking up around 50% of my CPU and spawning lots of gpg2 related processes. After a while the client (responsively!) reports "The SecureDrop server cannot be reached".

As I understand it, we download everything in one go. Is there an API call we can make to get just the sources with their latest message id and just download them instead. Upon clicking the source, if not already downloaded, perhaps only the messages for the specific source could be downloaded and processed..?

In any case, the end result is that the client responsively hangs and takes up CPU resources when there are a large number of sources. When there are only 50 the application appears to work fine except the latest-message preview in the source list is never updated (which can be coded around, see screenie below).

50_sources

My gut feeling is while the UI is now responsive, this highlights yet another performance problem -- i.e. the processing of large numbers of sources and their associated conversations.

Thoughts welcome.

Test Plan

Currently, keep poking with a stick and ask more questions.

Checklist

If these changes modify code paths involving cryptography, the opening of files in VMs or network (via the RPC service) traffic, Qubes testing in the staging environment is required. For fine tuning of the graphical user interface, testing in any environment in Qubes is required. Please check as applicable:

  • I have tested these changes in the appropriate Qubes environment
  • I do not have an appropriate Qubes OS workstation set up (the reviewer will need to test these changes)
  • These changes should not need testing in Qubes

If these changes add or remove files other than client code, packaging logic (e.g., the AppArmor profile) may need to be updated. Please check as applicable:

  • I have submitted a separate PR to the packaging repo
  • No update to the packaging logic (e.g., AppArmor profile) is required for these changes
  • I don't know and would appreciate guidance

@redshiftzero
Copy link
Contributor

the client is taking up around 50% of my CPU and spawning lots of gpg2 related processes

this is almost certainly due to this logic: after sync, we attempt to re-import every key for every source (from the server). If we rework the logic on the client-side to only import keys for sources that do not yet have keys (to do in a followup) then that should significantly reduce CPU usage

@sssoleileraaa
Copy link
Contributor

sssoleileraaa commented Jan 27, 2020

The initial removal of update_sources on message download ensures the UI remains responsive.

So currently we call update_sources, which gets the sources from the local database and updates the GUI source list, in several locations:

  • on_message_download_success
  • on_file_download_success
  • on_delete_source_success
  • on_authenticate_success
  • on_sync_success
  • login_offline_mode

Since we want to display new messages and replies and new sources on-the-fly, we refresh the source list every time we download something new that needs to be displayed.

I think we should think about removing this call from more than just the on_message_download_success method. We only need to update the source list after a successful metadata sync, because that's when the local database will have all the new sources. We also don't need to update the source list if there isn't a new source.

However, given a server with 200 sources, while the UI is now responsive, I no longer see the results I was able to see with far fewer sources (50) in previous tests. The server activity looks like this:

It would makes sense why removing the GUI update of the source list for each new message would speed this up.

172.17.0.1 - - [27/Jan/2020 11:51:51] "POST /api/v1/token HTTP/1.1" 200 -
172.17.0.1 - - [27/Jan/2020 11:52:20] "GET /api/v1/sources HTTP/1.1" 200 -
172.17.0.1 - - [27/Jan/2020 11:52:43] "GET /api/v1/sources HTTP/1.1" 200 -
2020-01-27 11:52:45,411 INFO Clearing shredder
2020-01-27 11:52:45,411 INFO Files to delete: 0

Which suggests something is pinging the API every 20 seconds or so. Not sure what though. I also notice from top that the client is taking up around 50% of my CPU and spawning lots of gpg2 related processes. After a while the client (responsively!) reports "The SecureDrop server cannot be reached".

so currently we call sync_api in a few more places (the big removal of sync_api is not yet fisnished). I am not seeing 20-second interval metadata syncs unless I change this timer to be 20 seconds: https://github.com/freedomofpress/securedrop-client/blob/master/securedrop_client/logic.py#L227

One way to see more frequent API calls is if there is a failure of a queue job, which will retry. The default timeout is 20 seconds for messages and replies, so perhaps there was so retrying happening.

As I understand it, we download everything in one go. Is there an API call we can make to get just the sources with their latest message id and just download them instead. Upon clicking the source, if not already downloaded, perhaps only the messages for the specific source could be downloaded and processed..?

This is something we have talked about and could do on the server-side, which seems ideal, but until then we could improve the client side of things by only updating the GUI source list after we get all the sources, not whenever there is a new message or reply (because we will already have the source for that message in the source list (recall that we update the source list after a metadata sync).

In any case, the end result is that the client responsively hangs and takes up CPU resources when there are a large number of sources. When there are only 50 the application appears to work fine except the latest-message preview in the source list is never updated (which can be coded around, see screenie below).

I do believe changing the code so that we only call update_sources in on_sync_success is a good optimization. So what you're doing makes sense! I think we should just go further and remove it in other locations and possibly add logic to only update the source list if there are new sources (we could do this in a separate PR as well).

@sssoleileraaa
Copy link
Contributor

Since we want to display new messages and replies and new sources on-the-fly, we refresh the source list every time we download something new that needs to be displayed.

Note: that was the thinking behind this but actually it's unnecessary as you've discovered @ntoll

@rmol
Copy link
Contributor

rmol commented Jan 28, 2020

@ntoll I tried with 200 sources and the GPG imports only took a few seconds, and weren't killing my machine, but this branch skips the import for keys we already have. It's based on this branch (unfreeze-ui) so hopefully it'll be easy to try. If it helps, I'll add tests and finish it up.

@ntoll
Copy link
Contributor Author

ntoll commented Jan 28, 2020

@rmol @creviera OK... I've merged you skip-gpg branch and removed the update_sources from various places. I'm about to embark on testing against a 200-source version of the server and checking the various aspects of the behaviour I've inevitably changed. I also want to update a few other things to make sure that when new messages arrive the source preview is updated properly.

@ntoll
Copy link
Contributor Author

ntoll commented Jan 28, 2020

This has been an interesting "building-a-house-of-cards" like exercise to get to this state: work slowly, carefully, but one wrong move and it all falls down. All the tests were done against a server with 200 sources.

This is currently still a WiP but I feel I've made good progress both in terms of diagnosing and understanding the problem from the client side of things and then coming up with a fix.

The retries after 20 seconds that I noticed were because the MetadataJob had an API call timeout of 20 seconds which wasn't long enough for the file to download (on my machine). I'm not sure why this is the case on my machine, but the side effect is that the job would always fail (time-out) and then try again and I'd be stuck in that state of never completed syncs. This was what was causing the CPU load. I've increased the timeout value and the sync successfully completes.

Then I noticed that while the UI was responsive in that it wasn't frozen, it didn't look as if it were doing anything while the sources were being downloaded. So I have bodged an animation together which will definitely need the magical super powers of @ninavizz to provide a 16x16 looping animated gif of an indicator to show the client is doing something. I also ensured that the status message was updated to indicate things were "in flight". This, too, needs checking in terms of UI / messaging. However, and please remember this is a temporary UI bodge, this vastly improves the UX in terms of responsiveness.

sync

As you can see, the UI is always responsive.

In addition I also managed to crash the client several times and eventually found that was due to some faulty login in the updated queue.py file. Since I've been on a bug hunt, I've also cleaned this problem up (see the diff for details... @creviera I'd love some feedback there please, 'cos you're the expert on this process and I'm painfully aware I might have missed something). Basically the bug was when a job was re-added to the metadata_queue the queue was Full and things broke.

Finally, as messages, files and replies are downloaded from the server the source list "preview last message" also needs updating. This is covered in another ticket (#707) but it feels related to all this work. I won't fix it here but once this is merged I want to nail that down now that I've become quite familiar with the code I'll be changing through the work done in this PR. ;-)

I'd like feedback, poking with a stick and comments about the content of this PR (remember it's got @rmol's branch for gpg key import improvements merged in too).

My aim is, assuming folks don't find gotchas, to get this PR revised with comprehensive unit tests (the work you see so far is the result of slow/careful exploration so is a bit like a "spike" at the moment) and incorporate an appropriate spinner from @ninavizz.

Phew.

@ntoll
Copy link
Contributor Author

ntoll commented Jan 28, 2020

OK... I need to revisit this work in light of #732

Copy link
Contributor

@sssoleileraaa sssoleileraaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like you just need to add tests and remove code unrelated to the removal of unnecessary GUI updates and the client freeze issue.

other than that, i like this approach

@@ -53,7 +53,7 @@ def call_api(self, api_client: API, session: Session) -> Any:
# TODO: Once https://github.com/freedomofpress/securedrop-client/issues/648, we will want to
# pass the default request timeout to api calls instead of setting it on the api object
# directly.
api_client.default_request_timeout = 20
api_client.default_request_timeout = 120
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you'll notice we have an issue for acquiring realistic timeouts for api calls here: #648

if you read through john's findings you'll see that 20 seconds is cutting it close for get_sources when there's 200, but it looks to be more than enough time for the other API calls made during a sync: get_submissions and get_all_replies

eventually we want to be able to scale these timeouts the way we do with file downloads but for now, using a default of 20 seconds seems like plenty of time for most cases except when there are 200+ sources for get_sources

Unfortunately we don't yet support passing a different default timeout to just the get_sources api method as you can see from my comment above:

        # TODO: Once https://github.com/freedomofpress/securedrop-client/issues/648, we will want to
        # pass the default request timeout to api calls instead of setting it on the api object
        # directly.

i think until we can vary our timeouts per request and based on size of the request, we should use the largest measurement from john's profiling and mutliple it by 1.5 or 2, so I think 30-40 second timeouts should be enough, right?

Copy link
Member

@eloquence eloquence Jan 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if @ntoll tested with freedomofpress/securedrop#5100 in place on the server which should already significantly improve performance (the cache gets primed on app server start)

@@ -215,6 +215,9 @@ def __init__(self):

self.active = False

self.download_animation = load_movie("sync_animation.gif")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed during standup you can just remove all these widget.py changes and focus on optimizing the UI so that it doesn't freeze when there are many sources by removing all the unnecessary update_sources calls and increasing the default timeout a bit.

@@ -103,7 +103,12 @@ def re_add_job(self, job: ApiJob) -> None:
'''
job.remaining_attempts = DEFAULT_NUM_ATTEMPTS
priority = self.JOB_PRIORITIES[type(job)]
self.queue.put_nowait((priority, job))
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is out of scope and can be made into a separate pr, right? or was this needed for optimizing UI during a sync? also see how we are moving towards a continuous background sync that only ever runs one sync at a time: #739

@ntoll
Copy link
Contributor Author

ntoll commented Jan 29, 2020

OK... I've extracted just the essential aspect of this fix, rebased/reset things and wrapped it all up into a neat single commit. I'll add another PR to address the queue related bugfix.

NOTE: this branch does not include @rmol's gpg key based branch any more.

Copy link
Contributor

@sssoleileraaa sssoleileraaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@sssoleileraaa sssoleileraaa merged commit 11ad0b1 into freedomofpress:master Jan 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Application totally frozen with large number of sources
5 participants