-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add retries for ParquetSerializer.restore_dataframe #401
add retries for ParquetSerializer.restore_dataframe #401
Conversation
Co-authored-by: Nefta Kanilmaz <nefta.kanilmaz@gmail.com>
kartothek/serialization/_parquet.py
Outdated
@@ -195,6 +163,94 @@ def store(self, store, key_prefix, df): | |||
return key | |||
|
|||
|
|||
def _restore_dataframe( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a note regarding git diffs and git blame:
git has a much easier time figuring the history out if you actually do not change too much. Instead, you're moving this entire code section, i.e. we're effectively lossing easier git blame.
This is mostly a style note but I would prefer to define this as a static_method as well to keep diffs small
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made _restore_dataframe
into static method, restore_dataframe
into class method
if nb_retry < (MAX_NB_RETRIES - 1): | ||
time.sleep(BACKOFF_TIME * 2 ** nb_retry) | ||
|
||
raise ParquetReadError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's your reasoning behind introducing a dedicated class of exceptions? The information it contains is identical to the log message and the type of exception is not reused anywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exception type describes the problem and it being a subtype of IOError is appropriate imo. I'd rather be too specific here than too broad
filter_query: Optional[str] = None, | ||
columns: Optional[Iterable[str]] = None, | ||
predicate_pushdown_to_io: bool = True, | ||
categories: Optional[Iterable[str]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need stronger guaranteed here than Iterable; after all, the first try to consume those iterators, leaving wrong arguments on the second try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably Sequence
is then a better candidate. It is very likely that this is not 100% properly done in many places.
Imho, it's not entirely clear, though. afaik, the difference between Sequence
and Iterable
is that a sequence must define __len__
and __getitem__
but that would exclude sets.
From your comment I would judge we need something like "iterable but not generator" but I do not know which type that should be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you really need to support sets and lists, I think the right type is Collection[str]
. OTOH, why not simply use something concrete here that reflects how it's used?
For instance, categories
needs to be a list already, as it's passed to an external function expecting a list.
columns
can;t be a set today because some functions call columns[:]
to copy it, which does not work for sets ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is that explicit, I would argue it is even mandatory. Although for [:]
we do not need a List
but merely a Sequence, e.g. it works on tuples as well)
Using a sequence/iterable instead of explicit list is a recommendation of mypy since, e.g. sequence is covariant in its variables where a list isn't. It also makes it very explicit that the object is not mutated and allows for more compatible typing.
Some documentation about this can be found here https://mypy.readthedocs.io/en/latest/generics.html#variance-of-generic-types
That all said, I'm fine either way. If mypy doesn't complain, it's good enough for me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my own curiosity
from collections import Iterable, Sequence
In [4]: isinstance([1, 2], Sequence)
Out[4]: True
In [5]: isinstance((1, 2), Sequence)
Out[5]: True
In [6]: isinstance((1, 2), Iterable)
Out[6]: True
In [7]: isinstance({1, 2}, Iterable)
Out[7]: True
In [8]: isinstance({1, 2}, Sequence)
Out[8]: False
In [9]: isinstance((x for x in [1, 2]), Iterable)
Out[9]: True
In [10]: isinstance((x for x in [1, 2]), Sequence)
Out[10]: False
generators are not sequences (makes sense, len, getitem, ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my personal opinion, this should be set[str]
, the logical data type but this would break the tests and is not really the "paradigm" we're using here anyways. I'd also rather have the type be explicit than duck typing.
I just copied the original function signature, we can probably have this debate somewhere else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said, set is wrong.
Why not just change it to list?
categories: Optional[Iterable[str]] = None, | ||
predicates: Optional[PredicatesType] = None, | ||
date_as_object: bool = False, | ||
) -> pd.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be some explanation here as to why there is a re-try, especially why there is a re-try on AssertionErrors, which is typically considered not a best practice and dangereous: AssertionErrors usually point to logic errors and logic errors can lead to invalid state of the program which is present even for the re-try. Therefore, a usual argument is to simply abort on assertion errors. So it;'s worth explaining why deviating from this here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will just change the underlying error to IOError since that's more appropriate than AssertionError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment and changed the BlockBuffer error to an error type inheriting from IOError
I'm fine with the changes here. Still, implementing a retry on this abstraction level really doesn't sit well with me. Can we at least create one github issue with the parquet AssertionError we've seen? afaik, we do not have any public breadcrumbs at all and I believe this helps with a justification for this retry if the issue is then linked in the code |
# XXX: We have been seeing weired `IOError` while reading Parquet files from Azure Blob Store, | ||
# thus, we implement retries at this point. This code should not live forever, it should be removed | ||
# once the underlying cause has been resolved | ||
for nb_retry in range(MAX_NB_RETRIES): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason you implemented the retry logic yourself? I would suggest to use an established library like tenacity. This also has nice features such as being able to overwrite the wait function in tests.
Also I think you should add tests for the retry logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am very strongly against exposing this as some kind of configuration. Current investigations hint towards a bug in our azure-storage interface and/or the iobuffer interface.
Considering our simple requirement, I am in favour of this approach to not introduce tenacity as a core dependency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. But especially with the self baked retries I think tests are important because the retry now contains untested code logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point about the testing. I would suggest to test this by patching the _restore_dataframe
method to raise once but succeeds on the second try. While this test would be very close to the implementation, I believe this is justified for this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We expect the following behavior:
- Retry on
OSError
MAX_NB_RETRIES
times or less if success. Log on retry - Don't retry on other exceptions
- Raise
ParquetReadError
afterMAX_NB_RETRIES
- Retry times use exponential backoff
I think all of these could be individual tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My 5 cents:
- I am all for keeping it simple and stupid and will keep the current retry approach -- especially because we should get rid of it once we can confirm the root cause.
- Tests +1, the how is unimportant, the behaviour that Stephan mentioned should probably be captured.
Retries in this setup go from 10ms to 160 ms (specifically:
0.01 0.02 0.04 0.08 0.16
seconds).This supersedes the first commit from #397.