-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add retries to restore dataframe #408
Add retries to restore dataframe #408
Conversation
Co-authored-by: Nefta Kanilmaz <nefta.kanilmaz@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add another test for the exponential backoff? Could be implemented by patching the sleep function and asserting the arguments it was called with. Also in the other tests you could patch the sleep function as no-op so that there is no unwanted wait time during tests.
imho, I do not think we'll need to test the backoff. At some point we'll need to draw a line between pragmatism and thoroughness and I have the feeling that testing this blows this change out of proportion. After all, this is not the core functionality of kartothek but merely a patch. Whether or not this is backed off should not even matter |
doc builds fail (yay, so glad I put this into GH actions; not yay for you :) ) |
Perhaps a matter of opinion but I think if the backoff is implemented it should also be tested. I think that's what you have to pay by not using an already tested library that implements this functionality for you. I also think that a bug in the retry logic could have quite severe consequences... |
I left that out on purpose, because I am not sure what we gain by testing that the waiting times are exponential -- to me, this sounds like an implementation detail and not the actual functionality. |
Well, it is a feature of Kartothek now because it has been implemented this way. I also don't get your reasoning: Should details of the implementation not be tested? Even if they are implemented from scratch? I do agree that such detailed retry implementations should not be part of Kartothek, hence my suggestion of using an established library for that. Imo what we gain is security against a bug in this "implementation detail". Let's say somehow a bug is introduced that causes the backoff to be on the scale of hours instead of ms. I think a test to prevent such regressions is beneficial. |
No, I think they shouldn't, for several reasons: I want to test the functionality (gain more stability on certain kind of errors by retrying) and not how this functionality is implemented (a hand-made exponential retry). I am currently seeing two ways forward in this discussion:
I am fine either way, with a slight preference to the pragmatic approach. |
c3deae6
to
6ed3a52
Compare
We could also just remove the backoff and use a fixed time to reduce the complexity of the untested code part. |
Whether one decides to test implementation details or not is a tough line to walk. Usually you do not want to be to close to the implementation and rather test on a high level to ensure the behaviour of your library is what you would expect. In my opinion, the backoff here is an example of a details we should skip |
If you sleep for a fixed time, this is still a backoff and we'd still have the same problem. |
That would fail the CI due to a timeout ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I think all arguments have been put forth, I'll leave it up to you. In any case this should not block us from merging this important fix.
nonlocal retry_count | ||
retry_count += 1 | ||
|
||
if not retry_count > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the typical way to write this would be retry_count <= 1
instead of not retry > 1
Looks like this is caused by Azure/azure-sdk-for-python#16723 |
Description:
Same as in #401
I don't have write access to lr4d's fork, so I can't push on his branch.
Baseline: Add retries for restoring dataframes. We have seen IOErrors on long running ktk + dask tasks with no clear idea what the root cause is. Therefore retry the serialization to gain more stability, until the root cause is fixed.