Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retries to restore dataframe #408

Merged

Conversation

NeroCorleone
Copy link
Contributor

Description:

Same as in #401
I don't have write access to lr4d's fork, so I can't push on his branch.

Baseline: Add retries for restoring dataframes. We have seen IOErrors on long running ktk + dask tasks with no clear idea what the root cause is. Therefore retry the serialization to gain more stability, until the root cause is fixed.

Copy link
Collaborator

@stephan-hesselmann-by stephan-hesselmann-by left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add another test for the exponential backoff? Could be implemented by patching the sleep function and asserting the arguments it was called with. Also in the other tests you could patch the sleep function as no-op so that there is no unwanted wait time during tests.

@fjetter
Copy link
Collaborator

fjetter commented Feb 10, 2021

imho, I do not think we'll need to test the backoff. At some point we'll need to draw a line between pragmatism and thoroughness and I have the feeling that testing this blows this change out of proportion. After all, this is not the core functionality of kartothek but merely a patch. Whether or not this is backed off should not even matter

@fjetter
Copy link
Collaborator

fjetter commented Feb 10, 2021

doc builds fail (yay, so glad I put this into GH actions; not yay for you :) )

CHANGES.rst Outdated Show resolved Hide resolved
@stephan-hesselmann-by
Copy link
Collaborator

imho, I do not think we'll need to test the backoff. At some point we'll need to draw a line between pragmatism and thoroughness and I have the feeling that testing this blows this change out of proportion. After all, this is not the core functionality of kartothek but merely a patch. Whether or not this is backed off should not even matter

Perhaps a matter of opinion but I think if the backoff is implemented it should also be tested. I think that's what you have to pay by not using an already tested library that implements this functionality for you. I also think that a bug in the retry logic could have quite severe consequences...

@NeroCorleone
Copy link
Contributor Author

Can you add another test for the exponential backoff?

I left that out on purpose, because I am not sure what we gain by testing that the waiting times are exponential -- to me, this sounds like an implementation detail and not the actual functionality.

@stephan-hesselmann-by
Copy link
Collaborator

stephan-hesselmann-by commented Feb 10, 2021

Can you add another test for the exponential backoff?

I left that out on purpose, because I am not sure what we gain by testing that the waiting times are exponential -- to me, this sounds like an implementation detail and not the actual functionality.

Well, it is a feature of Kartothek now because it has been implemented this way. I also don't get your reasoning: Should details of the implementation not be tested? Even if they are implemented from scratch? I do agree that such detailed retry implementations should not be part of Kartothek, hence my suggestion of using an established library for that.

Imo what we gain is security against a bug in this "implementation detail". Let's say somehow a bug is introduced that causes the backoff to be on the scale of hours instead of ms. I think a test to prevent such regressions is beneficial.

@NeroCorleone
Copy link
Contributor Author

Should details of the implementation not be tested?

No, I think they shouldn't, for several reasons: I want to test the functionality (gain more stability on certain kind of errors by retrying) and not how this functionality is implemented (a hand-made exponential retry).
However, I see your point about potential bugs by accidentally increasing the back-off time and also like the idea of using an already existing library, so I don't have to retry manually.

I am currently seeing two ways forward in this discussion:

  1. Replace the handmade retry with something that is proven to work (would require a tiny little bit more work time)
  2. Mitigate the risk for an accidental increase of the back-off time by adding an explicit comment (would be a weak safe guard).

I am fine either way, with a slight preference to the pragmatic approach.

@NeroCorleone NeroCorleone force-pushed the add_retries_to_restore_dataframe branch from c3deae6 to 6ed3a52 Compare February 10, 2021 15:59
@stephan-hesselmann-by
Copy link
Collaborator

We could also just remove the backoff and use a fixed time to reduce the complexity of the untested code part.

@fjetter
Copy link
Collaborator

fjetter commented Feb 10, 2021

Whether one decides to test implementation details or not is a tough line to walk. Usually you do not want to be to close to the implementation and rather test on a high level to ensure the behaviour of your library is what you would expect. In my opinion, the backoff here is an example of a details we should skip

@fjetter
Copy link
Collaborator

fjetter commented Feb 10, 2021

We could also just remove the backoff and use a fixed time to reduce the complexity of the untested code part.

If you sleep for a fixed time, this is still a backoff and we'd still have the same problem.

@fjetter
Copy link
Collaborator

fjetter commented Feb 10, 2021

Let's say somehow a bug is introduced that causes the backoff to be on the scale of hours instead

That would fail the CI due to a timeout ;)

Copy link
Collaborator

@stephan-hesselmann-by stephan-hesselmann-by left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I think all arguments have been put forth, I'll leave it up to you. In any case this should not block us from merging this important fix.

nonlocal retry_count
retry_count += 1

if not retry_count > 1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the typical way to write this would be retry_count <= 1 instead of not retry > 1

@fjetter fjetter merged commit 5f80d74 into JDASoftwareGroup:master Feb 11, 2021
@fjetter
Copy link
Collaborator

fjetter commented Feb 17, 2021

Looks like this is caused by Azure/azure-sdk-for-python#16723

@NeroCorleone NeroCorleone deleted the add_retries_to_restore_dataframe branch April 7, 2021 07:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants