Add retries to restore dataframe #408

NeroCorleone · 2021-02-10T15:03:11Z

Description:

Same as in #401
I don't have write access to lr4d's fork, so I can't push on his branch.

Baseline: Add retries for restoring dataframes. We have seen IOErrors on long running ktk + dask tasks with no clear idea what the root cause is. Therefore retry the serialization to gain more stability, until the root cause is fixed.

Co-authored-by: Nefta Kanilmaz <nefta.kanilmaz@gmail.com>

tests/serialization/test_parquet.py

stephan-hesselmann-by

Can you add another test for the exponential backoff? Could be implemented by patching the sleep function and asserting the arguments it was called with. Also in the other tests you could patch the sleep function as no-op so that there is no unwanted wait time during tests.

fjetter · 2021-02-10T15:17:14Z

imho, I do not think we'll need to test the backoff. At some point we'll need to draw a line between pragmatism and thoroughness and I have the feeling that testing this blows this change out of proportion. After all, this is not the core functionality of kartothek but merely a patch. Whether or not this is backed off should not even matter

fjetter · 2021-02-10T15:17:41Z

doc builds fail (yay, so glad I put this into GH actions; not yay for you :) )

CHANGES.rst

stephan-hesselmann-by · 2021-02-10T15:20:41Z

imho, I do not think we'll need to test the backoff. At some point we'll need to draw a line between pragmatism and thoroughness and I have the feeling that testing this blows this change out of proportion. After all, this is not the core functionality of kartothek but merely a patch. Whether or not this is backed off should not even matter

Perhaps a matter of opinion but I think if the backoff is implemented it should also be tested. I think that's what you have to pay by not using an already tested library that implements this functionality for you. I also think that a bug in the retry logic could have quite severe consequences...

NeroCorleone · 2021-02-10T15:27:37Z

Can you add another test for the exponential backoff?

I left that out on purpose, because I am not sure what we gain by testing that the waiting times are exponential -- to me, this sounds like an implementation detail and not the actual functionality.

stephan-hesselmann-by · 2021-02-10T15:38:47Z

Can you add another test for the exponential backoff?

I left that out on purpose, because I am not sure what we gain by testing that the waiting times are exponential -- to me, this sounds like an implementation detail and not the actual functionality.

Well, it is a feature of Kartothek now because it has been implemented this way. I also don't get your reasoning: Should details of the implementation not be tested? Even if they are implemented from scratch? I do agree that such detailed retry implementations should not be part of Kartothek, hence my suggestion of using an established library for that.

Imo what we gain is security against a bug in this "implementation detail". Let's say somehow a bug is introduced that causes the backoff to be on the scale of hours instead of ms. I think a test to prevent such regressions is beneficial.

NeroCorleone · 2021-02-10T15:56:19Z

Should details of the implementation not be tested?

No, I think they shouldn't, for several reasons: I want to test the functionality (gain more stability on certain kind of errors by retrying) and not how this functionality is implemented (a hand-made exponential retry).
However, I see your point about potential bugs by accidentally increasing the back-off time and also like the idea of using an already existing library, so I don't have to retry manually.

I am currently seeing two ways forward in this discussion:

Replace the handmade retry with something that is proven to work (would require a tiny little bit more work time)
Mitigate the risk for an accidental increase of the back-off time by adding an explicit comment (would be a weak safe guard).

I am fine either way, with a slight preference to the pragmatic approach.

stephan-hesselmann-by · 2021-02-10T16:02:27Z

We could also just remove the backoff and use a fixed time to reduce the complexity of the untested code part.

fjetter · 2021-02-10T16:02:48Z

Whether one decides to test implementation details or not is a tough line to walk. Usually you do not want to be to close to the implementation and rather test on a high level to ensure the behaviour of your library is what you would expect. In my opinion, the backoff here is an example of a details we should skip

fjetter · 2021-02-10T16:03:52Z

We could also just remove the backoff and use a fixed time to reduce the complexity of the untested code part.

If you sleep for a fixed time, this is still a backoff and we'd still have the same problem.

fjetter · 2021-02-10T16:09:20Z

Let's say somehow a bug is introduced that causes the backoff to be on the scale of hours instead

That would fail the CI due to a timeout ;)

stephan-hesselmann-by

Well I think all arguments have been put forth, I'll leave it up to you. In any case this should not block us from merging this important fix.

fjetter · 2021-02-11T08:25:57Z

tests/serialization/test_parquet.py

+        nonlocal retry_count
+        retry_count += 1
+
+        if not retry_count > 1:


the typical way to write this would be retry_count <= 1 instead of not retry > 1

fjetter · 2021-02-17T08:17:55Z

Looks like this is caused by Azure/azure-sdk-for-python#16723

lr4d and others added 9 commits February 10, 2021 13:45

add retries for ParquetSerializer.restore_dataframe

7b3306b

Co-authored-by: Nefta Kanilmaz <nefta.kanilmaz@gmail.com>

Update kartothek/serialization/_parquet.py

8630631

Update kartothek/serialization/_parquet.py

23ca257

Update kartothek/serialization/_parquet.py

4302e3a

make _restore_dataframe a staticmethod

d8592b5

raise error from previous error

df566bf

add comments, make AssertionError into IOError

1bd8178

custom error types

f045e3e

Add tests for the retry mechanism

dde7cbd

fjetter reviewed Feb 10, 2021

View reviewed changes

tests/serialization/test_parquet.py Show resolved Hide resolved

stephan-hesselmann-by reviewed Feb 10, 2021

View reviewed changes

fjetter reviewed Feb 10, 2021

View reviewed changes

CHANGES.rst Outdated Show resolved Hide resolved

Fix docs

6ed3a52

NeroCorleone force-pushed the add_retries_to_restore_dataframe branch from c3deae6 to 6ed3a52 Compare February 10, 2021 15:59

stephan-hesselmann-by approved these changes Feb 10, 2021

View reviewed changes

Add test counting the retries

a178164

fjetter reviewed Feb 11, 2021

View reviewed changes

fjetter approved these changes Feb 11, 2021

View reviewed changes

fjetter merged commit 5f80d74 into JDASoftwareGroup:master Feb 11, 2021

NeroCorleone deleted the add_retries_to_restore_dataframe branch April 7, 2021 07:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add retries to restore dataframe #408

Add retries to restore dataframe #408

NeroCorleone commented Feb 10, 2021

stephan-hesselmann-by left a comment

fjetter commented Feb 10, 2021

fjetter commented Feb 10, 2021

stephan-hesselmann-by commented Feb 10, 2021

NeroCorleone commented Feb 10, 2021

stephan-hesselmann-by commented Feb 10, 2021 •

edited

Loading

NeroCorleone commented Feb 10, 2021

stephan-hesselmann-by commented Feb 10, 2021

fjetter commented Feb 10, 2021

fjetter commented Feb 10, 2021

fjetter commented Feb 10, 2021

stephan-hesselmann-by left a comment

fjetter Feb 11, 2021

fjetter commented Feb 17, 2021

Add retries to restore dataframe #408

Add retries to restore dataframe #408

Conversation

NeroCorleone commented Feb 10, 2021

Description:

stephan-hesselmann-by left a comment

Choose a reason for hiding this comment

fjetter commented Feb 10, 2021

fjetter commented Feb 10, 2021

stephan-hesselmann-by commented Feb 10, 2021

NeroCorleone commented Feb 10, 2021

stephan-hesselmann-by commented Feb 10, 2021 • edited Loading

NeroCorleone commented Feb 10, 2021

stephan-hesselmann-by commented Feb 10, 2021

fjetter commented Feb 10, 2021

fjetter commented Feb 10, 2021

fjetter commented Feb 10, 2021

stephan-hesselmann-by left a comment

Choose a reason for hiding this comment

fjetter Feb 11, 2021

Choose a reason for hiding this comment

fjetter commented Feb 17, 2021

stephan-hesselmann-by commented Feb 10, 2021 •

edited

Loading