Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client retries failed jobs after logging out/in without restart #1420

Closed
eloquence opened this issue Feb 9, 2022 · 22 comments · Fixed by #1486
Closed

Client retries failed jobs after logging out/in without restart #1420

eloquence opened this issue Feb 9, 2022 · 22 comments · Fixed by #1486
Assignees
Labels
bug Something isn't working

Comments

@eloquence
Copy link
Member

Description

Tested against 2.1.0 server with:

Steps to Reproduce

  1. Create a new source
  2. Log into the client and wait for it to sync
  3. Send a reply and wait for it to send (optional, but this makes the issue very visible)
  4. Disconnect from the network
  5. Send a reply
  6. Click "Sign out"
  7. Reconnect
  8. Sign in and wait for sync

Expected Behavior

  • On step 6, reply transitions to "Failed to send"
  • On step 8, reply stays in "Failed to send" (a reply failure is considered ultimate for now)

Actual Behavior

  • On step 6, reply transitions to "Failed to send" as expected
  • On step 8, reply sends.

Logs

https://gist.github.com/eloquence/b502702021bf1bfd10099fb778ba3859 (0.5.1)

@eloquence eloquence added the bug Something isn't working label Feb 9, 2022
@eloquence
Copy link
Member Author

(I've not tagged this QA because it does not appear to be a recent regression.)

@eloquence
Copy link
Member Author

eloquence commented Feb 9, 2022

I should note that I've observed at least two other outcomes when testing this repeatedly:

  • Unexpected "Your session expired" errors (no user is being deleted here, so logging back in should work without an error)
  • A reply in a hybrid "sent/failed" state: it had the "Failed to send" indicator, but was colored in blue (the signed in user color), not red.

Which is to say that you may have to follow the STR repeatedly to see this exact result, and there may be multiple possible race condition outcomes resulting from an unclean state at the time of (re-)login. This may also help explain the inconsistent results reported by Ro on #1415 (comment)

@rocodes
Copy link
Contributor

rocodes commented Feb 9, 2022

Environment: servers @ 2.1.0
client @ 0.5.1

Confirming that the STR result in sent replies upon reconnection; here are more details.

  • Case 1 Replies sent after disconnecting from the network are retried upon reconnection (sample log). Note that the logout generates a logout failure warning (because the logout happens while offline). Replies were sent despite not signing in again.

Update: I just tested this by sending replies every 5 seconds for 30-35 seconds while disconnected from the network, then logging out of the client. When reconnecting the network, I was not yet signed in, but I watched all the failed replies get sent.

[...]
2022-02-09 11:59:40,173 - securedrop_client.logic:652(on_sync_failure) WARNING: sync failure: Cannot connect to the server.
2022-02-09 11:59:53,673 - securedrop_client.logic:652(on_sync_failure) WARNING: sync failure: Cannot connect to the server.
2022-02-09 11:59:53,968 - securedrop_client.logic:126(call_api) ERROR: Cannot connect to the server.
2022-02-09 11:59:53,969 - root:1138(on_logout_failure) INFO: Client logout failure
2022-02-09 12:00:09,070 - securedrop_client.logic:1115(on_reply_success) INFO: f405e1da-1cb8-43c3-bc76-e5e834bd54aa sent successfully
  • Case 2 Different results depending on when the draft was created: Sent 2 replies immediately after disconnecting from network, and sent a third 30 seconds later. Upon reconnecting and re-logging in, the first two replies were re-sent as the user, the last remained in the 'failed' state.

I did not observe the mixed UI state (failed indicator + signed-in user colour), or the 'session expired' error, but I also think this whole area needs more testing, so I'm going to pause on trying to repro for now and we can look at all the edge cases in more detail and more methodically.

@sssoleileraaa
Copy link
Contributor

#1429 is also high priority, and both will need to be added in the next release.

@cfm cfm self-assigned this Feb 17, 2022
@rocodes
Copy link
Contributor

rocodes commented Feb 22, 2022

A couple of observations/notes that may be of help:

  • You can reproduce this without completing step 8 (signing in again)- if you just kick the network, then "log out" (the API logout call will fail, but the gui logout will happen and the reply will be converted to a failed reply), then reconnect the network, from offline mode you can observe the 'failed' reply being sent even before re-login or sync 😬 (logs below)
2022-02-22 12:29:21,337 - securedrop_client.storage:204(chronometer) INFO: update_sources duration: 0.0165s
2022-02-22 12:29:21,352 - securedrop_client.storage:204(chronometer) INFO: update_files duration: 0.0142s
2022-02-22 12:29:21,371 - securedrop_client.storage:204(chronometer) INFO: update_messages duration: 0.0188s
2022-02-22 12:29:21,414 - securedrop_client.storage:204(chronometer) INFO: update_replies duration: 0.0428s
2022-02-22 12:29:49,318 - securedrop_client.logic:126(call_api) ERROR: Cannot connect to the server.
2022-02-22 12:29:49,319 - root:1155(on_logout_failure) ERROR: Client logout failure
2022-02-22 12:30:23,867 - securedrop_client.logic:655(on_sync_failure) WARNING: sync failure: Cannot connect to the server.
2022-02-22 12:30:25,470 - securedrop_client.logic:655(on_sync_failure) WARNING: sync failure: Cannot connect to the server.
[...]
2022-02-22 12:31:25,020 - securedrop_client.logic:655(on_sync_failure) WARNING: sync failure: Cannot connect to the server.
2022-02-22 12:31:26,465 - securedrop_client.logic:655(on_sync_failure) WARNING: sync failure: Cannot connect to the server.
2022-02-22 12:31:27,981 - securedrop_client.logic:655(on_sync_failure) WARNING: sync failure: Cannot connect to the server.
2022-02-22 12:31:31,400 - securedrop_client.logic:1132(on_reply_success) INFO: 87c63574-1aea-49a2-96cd-1c11f2cb26a5 sent successfully
  • During the offline time, the draft reply does indeed make it into the draftreplies table, with send_status_id 2 (FAILED), which is what we want to see.
  • When the GUI is closed + reopened, this issue is not reproduced. (so STR: log in, confirm sending works as normally, disconnect the network, send a draft reply, log out and observe it transition to "failed", then close + reopen the client, and reconnect the network, then log in).

@cfm
Copy link
Member

cfm commented Feb 23, 2022

@rocodes, thanks for this simpler reproduction, which checks out for me too. I'll dig into the job-queue logic tomorrow.

@cfm
Copy link
Member

cfm commented Feb 23, 2022

Based on discussion last week with @creviera and @rocodes, and the above steps to reproduce, I can confirm that this behavior is a side effect, or rather a non-effect, of the following stop() calls in securedrop_client.logic.Controller.logout():

self.api_sync.stop()
self.api_job_queue.stop()

We'd asked if it's appropriate for logout to be handled synchronously, outside of the api_job_queue. I think it is, for exactly this reason: the user's wish to log out should preempt, interrupt, stop, and clear that queue, regardless of whether and how the API responds to the logout request. Controller.logout() here preempts and interrupts the queue; ApiJobQueue.stop() stops the threads that process it but does not clear its jobs, as the following (failing) test case demonstrates:

--- a/tests/test_queue.py
+++ b/tests/test_queue.py
@@ -8,6 +8,7 @@ from sdclientapi import RequestTimeoutError, ServerConnectionError
 
 from securedrop_client.api_jobs.base import ApiInaccessibleError, PauseQueueJob
 from securedrop_client.api_jobs.downloads import FileDownloadJob, MessageDownloadJob
+from securedrop_client.api_jobs.uploads import SendReplyJob
 from securedrop_client.queue import ApiJobQueue, RunnableQueue
 from tests import factory
 
@@ -442,6 +443,20 @@ def test_ApiJobQueue_stop_stops_queue_threads(mocker):
     assert not job_queue.main_thread.isRunning()
     assert not job_queue.download_file_thread.isRunning()
 
+def test_ApiJobQueue_stop_clears_jobs(mocker):
+    mock_api = mocker.MagicMock()
+    mock_client = mocker.MagicMock()
+    mock_session_maker = mocker.MagicMock()
+
+    job_queue = ApiJobQueue(mock_client, mock_session_maker)
+    job_queue.start(mock_api)
+
+    job = SendReplyJob("mock", "mock", "mock", "mock")
+    job_queue.enqueue(job)
+    assert job_queue.main_queue.queue.qsize() == 1
+
+    job_queue.stop()
+    assert job_queue.main_queue.queue.empty()

Next I'll investigate whether we ever expect ApiJobQueue.stop() to leave unprocessed jobs available to be resumed later. If so, then we should probably fix this bug directly in Controller.logout(); if not, then this is probably a behavior we want to enforce in ApiJobQueue.

cfm added a commit that referenced this issue Feb 23, 2022
This test case currently fails by design, as a reproduction of the
low-level behavior responsible for bug #1420.
@sssoleileraaa
Copy link
Contributor

sssoleileraaa commented Feb 24, 2022

We'd asked if it's appropriate for logout to be handled synchronously, outside of the api_job_queue. I think it is, for exactly this reason: the user's wish to log out should preempt, interrupt, stop, and clear that queue, regardless of whether and how the API responds to the logout request.

Correction: Currently logout is handled asynchronously outside of ApiJobQueue (take a look at APICallRunner). If you want to logout synchronously, you'll have to stop using that.

We might want to consider using a separate ApiJobQueue for authentication. Just like sync it could enqueue the PauseQueueJob to the main queue to stop other jobs.

Next I'll investigate whether we ever expect ApiJobQueue.stop() to leave unprocessed jobs available to be resumed later.

In the future, we might consider adding a feature that asks the users if they want to finish processing jobs before logout, but for now, assume we will not want to finish jobs in the queue or resume them later.

@sssoleileraaa
Copy link
Contributor

I sent ^ before a meeting, so now I'll complete my thought here. I agree with your intuition about making the logout operation a synchronous action. The only reason I can see that you might want to add another ApiJobQueue would be for the baked-in network error handling and retry logic. But a queue that will only ever have one job at a time doesn't really make sense, which is also why we didn't use a queue for the MetadataSyncJob.

cfm added a commit that referenced this issue Feb 25, 2022
This test case currently fails by design, as a reproduction of the
low-level behavior responsible for bug #1420.
@cfm
Copy link
Member

cfm commented Feb 25, 2022

Thanks, @creviera! Out of band yesterday we discussed a three-pronged approach to this bug:

  1. clearing the ApiJobQueue when its processing is stopped (rather than merely paused);
  2. perhaps being able to interrupt the job currently being processed, rather than letting it finish or fail before stopping further processing; and
  3. perhaps handling authentication jobs in a dedicated ApiJobQueue.

(1) is drafted in 1420-stale-jobs as of 92047b6, for which I'll open a pull request on Monday. That should be sufficient to fix the bug here, leaving (2) and (3) as possible future improvements.

cfm added a commit that referenced this issue Feb 28, 2022
This test case currently fails by design, as a reproduction of the
low-level behavior responsible for bug #1420.
@cfm
Copy link
Member

cfm commented Mar 1, 2022

Per discussion with @creviera, next steps here after #1434:

  1. Invalidate API token immediately at logout (eventually via method on sdclientapi.API)
  2. Make login/out actual ApiJobs that can be retried, possibly in their own queue

@sssoleileraaa
Copy link
Contributor

  1. Make login/out actual ApiJobs that can be retried, possibly in their own queue

Another thing to consider is firing off a logout ApiJob when the QMainWindow closes. The queue just seems like more overhead, the more I think about it, and wouldn't think we would need a queue unless we want to be able to save the job for some reason during a network outage. Since we log out and invalidate the token locally, whether or not there is a successful response from the server, I don't see why we would need to keep the job around. Unless we want to develop a way to keep track of which sessions were never closed on the server so that we could attempt to close previous sessions in new sessions- this seems like over-engineering, and there's probably more we could do the server-side to address unused tokens perhaps? Something we could get @lsd-cat's feedback on when we get to this step.

@cfm
Copy link
Member

cfm commented Mar 3, 2022

New findings on the higher-level approach:

Invalidate API token immediately at logout (eventually via method on sdclientapi.API)

  1. Controller.logout() correctly clears its sdclientapi.API via invalidate_token() whether or not the logout request has succeeded.

However:

  1. RunnableQueue.process() uses the sdclientapi.API with which it was initialized or last start()ed:

    self.current_job._do_call_api(self.api_client, session)

  2. Since When token invalid user must log back in #750, RunnableQueue.stop() is tested (a) for calling its QThreads' quit() method, not for whether they've actually stopped; and (b) without first starting them, as we can see by patching the test:

    --- a/tests/test_queue.py
    +++ b/tests/test_queue.py
    @@ -435,20 +435,16 @@ def test_ApiJobQueue_start_if_queues_running(mocker):
     
     
     def test_ApiJobQueue_stop_stops_queue_threads(mocker):
    -    job_queue = ApiJobQueue(mocker.MagicMock(), mocker.MagicMock())
    -
    -    job_queue.stop()
    -
    -    assert not job_queue.main_thread.isRunning()
    -    assert not job_queue.download_file_thread.isRunning()
    +    mock_api = mocker.MagicMock()
    +    mock_client = mocker.MagicMock()
    +    mock_session_maker = mocker.MagicMock()
     
    +    job_queue = ApiJobQueue(mock_client, mock_session_maker)
     
    -def test_ApiJobQueue_stop_results_in_queue_threads_not_running(mocker):
    -    job_queue = ApiJobQueue(mocker.MagicMock(), mocker.MagicMock())
    -    job_queue.main_thread = mocker.MagicMock()
    -    job_queue.download_file_thread = mocker.MagicMock()
    +    job_queue.start(mock_api)
    +    assert job_queue.main_thread.isRunning()
    +    assert job_queue.download_file_thread.isRunning()
     
         job_queue.stop()
    -
    -    job_queue.main_thread.quit.assert_called_once_with()
    -    job_queue.download_file_thread.quit.assert_called_once_with()
    +    assert not job_queue.main_thread.isRunning()
    +    assert not job_queue.download_file_thread.isRunning()
    (.venv) user@sd-app:~/securedrop-client$ TESTS=tests/test_queue.py::test_ApiJobQueue_stop_stops_queue_threads make test
    [...]
    [gw1] [100%] FAILED tests/test_queue.py::test_ApiJobQueue_stop_stops_queue_threads 
    
    =================================== FAILURES ===================================
    __________________ test_ApiJobQueue_stop_stops_queue_threads ___________________
    [gw1] linux -- Python 3.7.3 /home/user/securedrop-client/.venv/bin/python
    
    mocker = <pytest_mock.plugin.MockerFixture object at 0x71c2c043e438>
    
        def test_ApiJobQueue_stop_stops_queue_threads(mocker):
            mock_api = mocker.MagicMock()
            mock_client = mocker.MagicMock()
            mock_session_maker = mocker.MagicMock()
        
            job_queue = ApiJobQueue(mock_client, mock_session_maker)
        
            job_queue.start(mock_api)
            assert job_queue.main_thread.isRunning()
            assert job_queue.download_file_thread.isRunning()
        
            job_queue.stop()
    >       assert not job_queue.main_thread.isRunning()
    E       assert not True
    E        +  where True = <built-in method isRunning of QThread object at 0x71c2c04259d8>()
    E        +    where <built-in method isRunning of QThread object at 0x71c2c04259d8> = <PyQt5.QtCore.QThread object at 0x71c2c04259d8>.isRunning
    E        +      where <PyQt5.QtCore.QThread object at 0x71c2c04259d8> = <securedrop_client.queue.ApiJobQueue object at 0x71c2c09c9e58>.main_thread
    
    tests/test_queue.py:449: AssertionError
    [...]

I think (2) points to a self-sufficient fix for the original bug here, if we can clear each RunnableQueue.api_client on logout (see also: #761) and/or on QThread.finished. In attempting to make that change, I've run again into the failure-mode described in #1434 (comment), which together with (3) makes me think the QThreads are never finishing on RunnableQueue.stop(). I'll dig back into both tomorrow.

@cfm
Copy link
Member

cfm commented Mar 4, 2022

Incremental progress today:

RunnableQueue.process() uses the sdclientapi.API with which it was initialized or last start()ed

Clarification: RunnableQueue.process() does clear its api_client attribute on an ApiInaccessibleError, such as when a logout request fails. However, a SendReplyJob fails with a SendReplyJobTimeoutError, whereupon on it's reenqueued. As soon as the RunnableQueue is restarted with a new api_client, the SendReplyJob will be processed with it—that is, using the new session, whether it belongs to the same or a different user.

Example: After the PauseQueueJob, when the RunnableQueue resumes, the next job will be the reenqueued SendReplyJob.
Mar  3 16:47:14 localhost 2022-03-03 16:47:14,506 - securedrop_client.queue:176(process) DEBUG: SendReplyJobTimeoutError: Failed to send reply for source 8a288d23-556a-4821-be67-f0ec84e83373 due to Exception: Cannot connect to the server.
Mar  3 16:47:14 localhost 2022-03-03 16:47:14,506 - securedrop_client.queue:110(add_job) DEBUG: Added <securedrop_client.api_jobs.base.PauseQueueJob object at 0x653030503e58> to queue
Mar  3 16:47:14 localhost 2022-03-03 16:47:14,506 - securedrop_client.queue:127(_re_add_job) DEBUG: Added SendReplyJob('b5490155-66aa-496f-b42c-6ac6a86a4a1a', 0) to queue
Mar  3 16:47:14 localhost 2022-03-03 16:47:14,509 - securedrop_client.logic:1129(on_reply_failure) DEBUG: b5490155-66aa-496f-b42c-6ac6a86a4a1a failed to send
Mar  3 16:47:14 localhost 2022-03-03 16:47:14,511 - securedrop_client.queue:254(on_main_queue_paused) DEBUG: Paused main queue

Given this behavior, if we're not going to clear the queues on RunnableQueue.stop(), we'll need to bind jobs somehow to their api_client sessions, for example by failing if the current api_client has a different (token_journalist_uuid, token_expiration) tuple than it was originally created and enqueued with. That smells like a layering violation.

For comparison, @rocodes saw a cousin of this behavior in #1420 (comment), because ApiSyncBackgroundTask, like RunnableQueue, keeps its own api_client. But that's fine: it can just keep trying, and failing, with the old api_client until it's able to reconnect with a new one.

Since #750, RunnableQueue.stop() is tested (a) for calling its QThreads' quit() method, not for whether they've actually stopped; and (b) without first starting them, as we can see by patching the test

This is fixed in dcee3ee. Despite a bit of further tinkering in this branch, the failure-mode described in #1434 (comment) remains. I'll follow up there.

@sssoleileraaa
Copy link
Contributor

Clarification: RunnableQueue.process() does clear its api_client attribute on an ApiInaccessibleError, such as when a logout request fails. However, a SendReplyJob fails with a SendReplyJobTimeoutError, whereupon on it's reenqueued. As soon as the RunnableQueue is restarted with a new api_client, the SendReplyJob will be processed with it—that is, using the new session, whether it belongs to the same or a different user.

Nice investigation work! This checks out since SendReplyJobTimeoutError is a RequestTimeoutError (also see

If a RequestTimeoutError or ServerConnectionError is encountered while processing a job, the
job will be added back to the queue, the processing loop will stop, and the paused signal will
be emitted. New jobs can still be added, but the processing function will need to be called
again in order to resume. The processing loop is resumed when the resume signal is emitted.
).

Given this behavior, if we're not going to clear the queues on RunnableQueue.stop(), we'll need to bind jobs somehow to their api_client sessions, for example by failing if the current api_client has a different (token_journalist_uuid, token_expiration) tuple than it was originally created and enqueued with. That smells like a layering violation.

I do like your approach with starting out at a higher level, and just clearing the queue on logout. This would get rid of an SendReplyJobs that get re-enqueued during a network connection issue in which the user logs out.

Despite a bit of further tinkering in this branch, the failure-mode described in #1434 (comment) remains. I'll follow up there.****

👀

cfm added a commit that referenced this issue Mar 8, 2022
This test case currently fails by design, as a reproduction of the
low-level behavior responsible for bug #1420.
@cfm
Copy link
Member

cfm commented Mar 10, 2022

To summarize recent work and discussions: #1434 is ready for review to fix this behavior specifically. We're separately reevaluating the SecureDrop API's authentication flow in light of the "failed logout" symptom, but we won't address that here.

cfm added a commit that referenced this issue Mar 15, 2022
This test case currently fails by design, as a reproduction of the
low-level behavior responsible for bug #1420.
@cfm
Copy link
Member

cfm commented Mar 17, 2022

In #1434 (comment), @creviera and I have concluded that we're so far unable to fix this bug directly in ApiJobQueue.stop() either:

  1. blockingly, in f735088, by composing RunnableQueue.queue.queue.clear() and QThread.wait(); or
  2. non-blockingly, in e00e243, by enqueuing a highest-priority ClearQueueJob.

That we can't come up with a self-contained patch for this behavior, that works consistently across network-level failure-modes, suggests to us that we've encountered a corner case along the edges of the Client's current handling of network, session, sync, and queue states. We're therefore going to take this bug as our cue (queue ;-) to zoom out, define carefully how we expect the application to behave across these state machines, and then return to code-level work from that understanding.

@sssoleileraaa
Copy link
Contributor

@cfm this doesn't fix the underlying issue, but there is a bug where the following should be called outside of the conditional in order to ensure that we update the authenticated_user every time we logout (this is why we saw a pending reply sent as the new user that logged in after the outage):

if self.api is not None:
self.call_api(self.api.logout, self.on_logout_success, self.on_logout_failure)
self.invalidate_token()

should be:

  if self.api is not None: 
     self.call_api(self.api.logout, self.on_logout_success, self.on_logout_failure) 

  self.invalidate_token()

@sssoleileraaa
Copy link
Contributor

sssoleileraaa commented Mar 29, 2022

FWIW, a simpler version of this STR:

Quick STR

  1. log into the client
  2. Stop tor in the sd-whonix tor-control-panel
  3. send a reply
  4. logout
  5. Restart tor in the sd-whonix tor-control-panel
  6. Wait for a few seconds (no need to log in again)

Expected

For the draft reply to remain in a failed status

Actual

The failed reply is resent

Notes

  • If you send multiple replies ("1", "2", "3", and so on), it seems like only the first pending reply is resent

  • Going back to our discussion earlier today, @cfm asked if we really want to keep replies in a pending status for as long as we do (right now it's until you logout, quit, or the connection issue is fixed and then it's resent). If we were to remove the following code, the pending replies could fail after ~10 seconds and would not be re-enqueued:

    except (RequestTimeoutError, ServerConnectionError) as e:
    message = "Failed to send reply for source {id} due to Exception: {error}".format(
    id=self.source_uuid, error=e
    )
    raise SendReplyJobTimeoutError(message, self.reply_uuid)

    I just want to point out that our intention has been to eventually implement a user-initiated retry action next to failed replies, so removing the code above could be done in tandem with a retry button, but that would require a lot more work. Just something to consider while thinking about this problem.

@sssoleileraaa
Copy link
Contributor

I can't seem to find related figma prototypes that show the retry (not refresh! whoops) button, but I recall nina doing some work around this. I'll continue the prototype hunt tomorrow.

@cfm
Copy link
Member

cfm commented May 11, 2022

@creviera hypothesized today that the ApiJobQueue behavior may have been the wrong tree to bark up here, as outlined in #1457 (comment). I can verify that hypothesis in casual testing of #1486 but will do so more rigorously before marking it ready for review.

@cfm
Copy link
Member

cfm commented May 19, 2022

See #1486 (comment) and its test plan for clarification versus #1457.

cfm added a commit that referenced this issue May 27, 2022
This test case currently fails by design, as a reproduction of the
low-level behavior responsible for bug #1420.
cfm added a commit that referenced this issue Nov 8, 2022
This test case currently fails by design, as a reproduction of the
low-level behavior responsible for bug #1420.
cfm added a commit that referenced this issue Dec 9, 2022
This test case currently fails by design, as a reproduction of the
low-level behavior responsible for bug #1420.
cfm added a commit that referenced this issue Dec 22, 2022
This test case currently fails by design, as a reproduction of the
low-level behavior responsible for bug #1420.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants