Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OAuth2 token expiring during a very long discovery makes sync impossible #6814

Closed
ogoffart opened this issue Oct 15, 2018 · 11 comments
Closed
Labels
Discussion ReadyToTest QA, please validate the fix/enhancement
Milestone

Comments

@ogoffart
Copy link
Contributor

As reported in #6677 and https://github.com/owncloud/enterprise/issues/2880:

The discovery of huge trees can takes long, and if it takes more than one hour, the oauth2 token expires and the sync aborts.
Next sync will start discovery from scratch again, which will take more than one hour again, and the client will never sync.

Workaround: use selective sync to avoid syncing part of the tree, wait untill this syncs successfully, then enable more directories, little by little.

@ogoffart
Copy link
Contributor Author

Possible solutions:

  1. Wait for 2.6 and the new_discovery_algorithm branch to be merged. This have a much faster discovery as it does work in parallel. Hopefully this can be down under 1h. In the long term (although this is not yet implemented), we can start propagation in parallel, which means that part already propagated do not need to be re-discovered on next sync.

  2. Have a discovery cache ( Caching remote discovery results #2976 (comment) ) . But this would also be a 2.6 feature and would be pointless if we have propagation in parallel of discovery.

  3. Maybe the most obvious: instead of breaking the sync when the token expire, just renew the token and continue. However, this is not as simple as it sounds.
    I guess this could be do-able if we:

    • In HttpCredentials::slotAuthentication when using oauth2, we need to request the new token. We need to introduce a new state while we are requesting a token.
    • intercept that particular error in AbstractNetworkJob::slotFinished, and redo the request and enqueue it somewhere. (where?)
    • Queue any other requests that are done while we are requesting the token.
    • Once we have the token, we can re-do all the requests. In case of error, cancel them all.

All of that is quite some work, and maybe bit difficult to fix in 2.5.1

@ckamm
Copy link
Contributor

ckamm commented Oct 15, 2018

I think 3 is the only approach that's not just a workaround: There could be odd situations where discovery takes a while longer, or other network activity that's broken by expiring tokens. Maybe some server admin makes the tokens expire much more frequently.

But as you say it's also error prone. There could even be multiple requests in flight that all come back with a token expired error. Restarting requests isn't too hard (done for redirects already) - but all the queuing and holding would be quite complex.

@guruz
Copy link
Contributor

guruz commented Oct 15, 2018

@ogoffart @ckamm Why not blindly renew the token every 10 minutes and for each new request use the current latest one? ("viel renew hilft viel")
(Do this independently of the authentication state machine, e.g. in parallel to everything else)
(I don't know how this works server side, will the old and new token be valid at same time or will this somehow then clash? CC @DeepDiver1975 )

@guruz guruz added this to the 2.5.1 milestone Oct 15, 2018
@ckamm
Copy link
Contributor

ckamm commented Oct 16, 2018

@guruz Does indeed sound easiest if both tokens would be valid at the same time for a bit. It'd break if people set expiry to less than the client's renew interval, but that sounds acceptable.

@DeepDiver1975
Copy link
Member

Refresh will kill the old token.
Afaik you are getting Back the expiration with the token. Can this help to know when to renew the token?

@guruz
Copy link
Contributor

guruz commented Oct 16, 2018

@DeepDiver1975 Meh.
It can help, but we'd still need to pause (and queue and later restart) any requests that we're about-to-send (and pause the sync).
How easy would it be to change the OAuth2 app to keep some tokens alive? (if @ogoffart and @ckamm agree that the client side approach in #6814 (comment) is OK)

@DeepDiver1975
Copy link
Member

Are you getting back an explicit error which tells you that the token has expired? Should be ...

Following the flow in the spec refreshing happens after expiration: https://tools.ietf.org/html/rfc6749#section-1.5

Killing the old access token after refresh is a requirement as per specs if I'm not misstaken

@ogoffart
Copy link
Contributor Author

ogoffart commented Oct 16, 2018

The server implementation is fine. The problem is the state machine we have client side, it needs to support several call at the same time, and be able to resume arbitrary calls.

Currently, the sync engine abort, and let the connection validator renew the token, then the sync restarts, from scratch.

As I said in the first comment, we would need to queue the pending call, and re-do call that failled.
This is definitively doable, but is not a trivial change.

I'll have a try implementing it, and let's see if we merge it in 2.5 or not.

@guruz
Copy link
Contributor

guruz commented Oct 16, 2018

we would need to queue the pending call

There is many pending calls, not only one. Also outside the sync engine you might have things (e.g. selective sync folder requests, activities etc) that you need to fail in a nice way or retry. Maybe a whole new state for each network job? PendingAuthTokenChange?

@ogoffart
Copy link
Contributor Author

My idea was to ass a state in the http credentials.
It is possible, in AbstractNetworkJob::slotFinished to detect the job that failed for this reason, and restart them, or enqueue them, depending on the state.
The problem is that i don't know how to avoid starting new jobs. as they kind of create the QNetworkReply directly. Maybe by just aborting them so we reach AbstractNetworkJob::slotFinished

I actually hope that the jobs can be restarted. The redirection code seem to restart them, so i assume it is possible.

@ckamm
Copy link
Contributor

ckamm commented Oct 17, 2018

Most request creation already funnels through Account::sendRawRequest and AbstractNetworkJob::sendRequest, but does indeed expect that a QNetworkReply will be available immediately. I imagine most callers don't really care about the reply being available immediately?

Some use it though and lots of places call AbstractNetworkJob::reply (though usually once finished only). I imagine making it possibly return 0 would need careful investiation of all of these.

Restarting is indeed already done for redirects: It works if the body isn't sequential (I don't think we ever use that) and if there's no implicit body. Implicit bodies happen for POST requests with url query parameters. Qt will automatically move the url args into the request body, and I didn't find a way to access that body or the url args afterwards. We could work around that by doing it ourselves.

ogoffart added a commit that referenced this issue Oct 17, 2018
OAuth2 access token typically only has a token valid for 1 hour.
Before this patch, when the token was timing out during the sync, the
sync was aborted, and the ConnectionValidator was then requesting a new
token, so the sync can be started over.
If the discovery takes longer than the oauth2 validity, this means that
the sync can never proceed, as it would be always restarted from scratch.

With this patch, we try to transparently renew the OAuth2 token and restart
the jobs that failed because the access token was invalid.

Note that some changes were required in the GETFile job because it handled
the error itself and so it was erroring the jobs before its too late.

Issue #6814
ogoffart added a commit that referenced this issue Oct 17, 2018
OAuth2 access token typically only has a token valid for 1 hour.
Before this patch, when the token was timing out during the sync, the
sync was aborted, and the ConnectionValidator was then requesting a new
token, so the sync can be started over.
If the discovery takes longer than the oauth2 validity, this means that
the sync can never proceed, as it would be always restarted from scratch.

With this patch, we try to transparently renew the OAuth2 token and restart
the jobs that failed because the access token was invalid.

Note that some changes were required in the GETFile job because it handled
the error itself and so it was erroring the jobs before its too late.

Issue #6814
ogoffart added a commit that referenced this issue Oct 18, 2018
OAuth2 access token typically only has a token valid for 1 hour.
Before this patch, when the token was timing out during the sync, the
sync was aborted, and the ConnectionValidator was then requesting a new
token, so the sync can be started over.
If the discovery takes longer than the oauth2 validity, this means that
the sync can never proceed, as it would be always restarted from scratch.

With this patch, we try to transparently renew the OAuth2 token and restart
the jobs that failed because the access token was invalid.

Note that some changes were required in the GETFile job because it handled
the error itself and so it was erroring the jobs before its too late.

Issue #6814
ogoffart added a commit that referenced this issue Oct 18, 2018
OAuth2 access token typically only has a token valid for 1 hour.
Before this patch, when the token was timing out during the sync, the
sync was aborted, and the ConnectionValidator was then requesting a new
token, so the sync can be started over.
If the discovery takes longer than the oauth2 validity, this means that
the sync can never proceed, as it would be always restarted from scratch.

With this patch, we try to transparently renew the OAuth2 token and restart
the jobs that failed because the access token was invalid.

Note that some changes were required in the GETFile job because it handled
the error itself and so it was erroring the jobs before its too late.

Issue #6814
ogoffart added a commit that referenced this issue Oct 19, 2018
OAuth2 access token typically only has a token valid for 1 hour.
Before this patch, when the token was timing out during the sync, the
sync was aborted, and the ConnectionValidator was then requesting a new
token, so the sync can be started over.
If the discovery takes longer than the oauth2 validity, this means that
the sync can never proceed, as it would be always restarted from scratch.

With this patch, we try to transparently renew the OAuth2 token and restart
the jobs that failed because the access token was invalid.

Note that some changes were required in the GETFile job because it handled
the error itself and so it was erroring the jobs before its too late.

Issue #6814
@ogoffart ogoffart added the ReadyToTest QA, please validate the fix/enhancement label Oct 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion ReadyToTest QA, please validate the fix/enhancement
Projects
None yet
Development

No branches or pull requests

4 participants