add retries for ParquetSerializer.restore_dataframe #401

lr4d · 2021-02-04T10:28:13Z

Retries in this setup go from 10ms to 160 ms (specifically: 0.01 0.02 0.04 0.08 0.16 seconds).

This supersedes the first commit from #397.

Closes #xxxx
Changelog entry

Co-authored-by: Nefta Kanilmaz <nefta.kanilmaz@gmail.com>

kartothek/serialization/_parquet.py

fjetter · 2021-02-04T13:48:52Z

kartothek/serialization/_parquet.py

@@ -195,6 +163,94 @@ def store(self, store, key_prefix, df):
        return key


+def _restore_dataframe(


just a note regarding git diffs and git blame:

git has a much easier time figuring the history out if you actually do not change too much. Instead, you're moving this entire code section, i.e. we're effectively lossing easier git blame.
This is mostly a style note but I would prefer to define this as a static_method as well to keep diffs small

Made _restore_dataframe into static method, restore_dataframe into class method

fjetter · 2021-02-04T13:50:49Z

kartothek/serialization/_parquet.py

+                if nb_retry < (MAX_NB_RETRIES - 1):
+                    time.sleep(BACKOFF_TIME * 2 ** nb_retry)
+
+        raise ParquetReadError(


What's your reasoning behind introducing a dedicated class of exceptions? The information it contains is identical to the log message and the type of exception is not reused anywhere

The exception type describes the problem and it being a subtype of IOError is appropriate imo. I'd rather be too specific here than too broad

kartothek/serialization/_parquet.py

jochen-ott-by · 2021-02-05T09:03:11Z

kartothek/serialization/_parquet.py

+        filter_query: Optional[str] = None,
+        columns: Optional[Iterable[str]] = None,
+        predicate_pushdown_to_io: bool = True,
+        categories: Optional[Iterable[str]] = None,


I think you need stronger guaranteed here than Iterable; after all, the first try to consume those iterators, leaving wrong arguments on the second try.

Probably Sequence is then a better candidate. It is very likely that this is not 100% properly done in many places.

Imho, it's not entirely clear, though. afaik, the difference between Sequence and Iterable is that a sequence must define __len__ and __getitem__ but that would exclude sets.

From your comment I would judge we need something like "iterable but not generator" but I do not know which type that should be

If you really need to support sets and lists, I think the right type is Collection[str]. OTOH, why not simply use something concrete here that reflects how it's used?

For instance, categories needs to be a list already, as it's passed to an external function expecting a list.
columns can;t be a set today because some functions call columns[:] to copy it, which does not work for sets ...

If it is that explicit, I would argue it is even mandatory. Although for [:] we do not need a List but merely a Sequence, e.g. it works on tuples as well)
Using a sequence/iterable instead of explicit list is a recommendation of mypy since, e.g. sequence is covariant in its variables where a list isn't. It also makes it very explicit that the object is not mutated and allows for more compatible typing.

Some documentation about this can be found here https://mypy.readthedocs.io/en/latest/generics.html#variance-of-generic-types

That all said, I'm fine either way. If mypy doesn't complain, it's good enough for me

For my own curiosity

from collections import Iterable, Sequence In [4]: isinstance([1, 2], Sequence) Out[4]: True In [5]: isinstance((1, 2), Sequence) Out[5]: True In [6]: isinstance((1, 2), Iterable) Out[6]: True In [7]: isinstance({1, 2}, Iterable) Out[7]: True In [8]: isinstance({1, 2}, Sequence) Out[8]: False In [9]: isinstance((x for x in [1, 2]), Iterable) Out[9]: True In [10]: isinstance((x for x in [1, 2]), Sequence) Out[10]: False

generators are not sequences (makes sense, len, getitem, ...)

In my personal opinion, this should be set[str], the logical data type but this would break the tests and is not really the "paradigm" we're using here anyways. I'd also rather have the type be explicit than duck typing.

I just copied the original function signature, we can probably have this debate somewhere else

As I said, set is wrong.
Why not just change it to list?

jochen-ott-by · 2021-02-05T09:06:11Z

kartothek/serialization/_parquet.py

+        categories: Optional[Iterable[str]] = None,
+        predicates: Optional[PredicatesType] = None,
+        date_as_object: bool = False,
+    ) -> pd.DataFrame:


There should be some explanation here as to why there is a re-try, especially why there is a re-try on AssertionErrors, which is typically considered not a best practice and dangereous: AssertionErrors usually point to logic errors and logic errors can lead to invalid state of the program which is present even for the re-try. Therefore, a usual argument is to simply abort on assertion errors. So it;'s worth explaining why deviating from this here.

I will just change the underlying error to IOError since that's more appropriate than AssertionError

Added a comment and changed the BlockBuffer error to an error type inheriting from IOError

fjetter · 2021-02-05T15:19:01Z

I'm fine with the changes here. Still, implementing a retry on this abstraction level really doesn't sit well with me. Can we at least create one github issue with the parquet AssertionError we've seen? afaik, we do not have any public breadcrumbs at all and I believe this helps with a justification for this retry if the issue is then linked in the code

stephan-hesselmann-by · 2021-02-09T15:39:51Z

kartothek/serialization/_parquet.py

+        # XXX: We have been seeing weired `IOError` while reading Parquet files from Azure Blob Store,
+        # thus, we implement retries at this point. This code should not live forever, it should be removed
+        # once the underlying cause has been resolved
+        for nb_retry in range(MAX_NB_RETRIES):


Is there a reason you implemented the retry logic yourself? I would suggest to use an established library like tenacity. This also has nice features such as being able to overwrite the wait function in tests.

Also I think you should add tests for the retry logic.

I am very strongly against exposing this as some kind of configuration. Current investigations hint towards a bug in our azure-storage interface and/or the iobuffer interface.

Considering our simple requirement, I am in favour of this approach to not introduce tenacity as a core dependency

Fair enough. But especially with the self baked retries I think tests are important because the retry now contains untested code logic.

Fair point about the testing. I would suggest to test this by patching the _restore_dataframe method to raise once but succeeds on the second try. While this test would be very close to the implementation, I believe this is justified for this case

cc @NeroCorleone

We expect the following behavior:

Retry on OSError MAX_NB_RETRIES times or less if success. Log on retry

Don't retry on other exceptions

Raise ParquetReadError after MAX_NB_RETRIES

Retry times use exponential backoff

I think all of these could be individual tests.

My 5 cents:

I am all for keeping it simple and stupid and will keep the current retry approach -- especially because we should get rid of it once we can confirm the root cause.

Tests +1, the how is unimportant, the behaviour that Stephan mentioned should probably be captured.

NeroCorleone · 2021-02-10T10:11:21Z

@fjetter fyi #407
I'll reference it in the code.

add retries for ParquetSerializer.restore_dataframe

9284672

Co-authored-by: Nefta Kanilmaz <nefta.kanilmaz@gmail.com>

lr4d requested a review from fjetter February 4, 2021 10:28

Update CHANGES.rst

b907294

lr4d mentioned this pull request Feb 4, 2021

Retry for restoring df #397

Closed

fjetter requested changes Feb 4, 2021

View reviewed changes

lr4d commented Feb 4, 2021

View reviewed changes

kartothek/serialization/_parquet.py Outdated Show resolved Hide resolved

lr4d commented Feb 4, 2021

View reviewed changes

kartothek/serialization/_parquet.py Outdated Show resolved Hide resolved

lr4d and others added 5 commits February 4, 2021 18:16

Update kartothek/serialization/_parquet.py

082dbfe

Update kartothek/serialization/_parquet.py

f95ddbc

Update kartothek/serialization/_parquet.py

e5a01fd

make _restore_dataframe a staticmethod

be1ff82

raise error from previous error

f714737

lr4d requested a review from nefta-kanilmaz-by February 5, 2021 08:50

jochen-ott-by reviewed Feb 5, 2021

View reviewed changes

lr4d added 2 commits February 5, 2021 15:33

add comments, make AssertionError into IOError

a1f99ba

custom error types

179e72b

stephan-hesselmann-by reviewed Feb 9, 2021

View reviewed changes

NeroCorleone mentioned this pull request Feb 10, 2021

Add retries to restore dataframe #408

Merged

fjetter closed this Feb 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add retries for ParquetSerializer.restore_dataframe #401

add retries for ParquetSerializer.restore_dataframe #401

lr4d commented Feb 4, 2021 •

edited

Loading

fjetter Feb 4, 2021

lr4d Feb 5, 2021

fjetter Feb 4, 2021

lr4d Feb 5, 2021

jochen-ott-by Feb 5, 2021

fjetter Feb 5, 2021

jochen-ott-by Feb 5, 2021

fjetter Feb 5, 2021 •

edited

Loading

fjetter Feb 5, 2021 •

edited

Loading

lr4d Feb 5, 2021

jochen-ott-by Feb 5, 2021

jochen-ott-by Feb 5, 2021

lr4d Feb 5, 2021

lr4d Feb 5, 2021

fjetter commented Feb 5, 2021

stephan-hesselmann-by Feb 9, 2021 •

edited

Loading

fjetter Feb 10, 2021

stephan-hesselmann-by Feb 10, 2021

fjetter Feb 10, 2021

stephan-hesselmann-by Feb 10, 2021

NeroCorleone Feb 10, 2021

NeroCorleone commented Feb 10, 2021

		@@ -195,6 +163,94 @@ def store(self, store, key_prefix, df):
		return key


		def _restore_dataframe(

add retries for ParquetSerializer.restore_dataframe #401

add retries for ParquetSerializer.restore_dataframe #401

Conversation

lr4d commented Feb 4, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjetter Feb 5, 2021 • edited Loading

Choose a reason for hiding this comment

fjetter Feb 5, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjetter commented Feb 5, 2021

stephan-hesselmann-by Feb 9, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NeroCorleone commented Feb 10, 2021

lr4d commented Feb 4, 2021 •

edited

Loading

fjetter Feb 5, 2021 •

edited

Loading

fjetter Feb 5, 2021 •

edited

Loading

stephan-hesselmann-by Feb 9, 2021 •

edited

Loading