Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add retries for ParquetSerializer.restore_dataframe #401

Closed

Conversation

lr4d
Copy link
Collaborator

@lr4d lr4d commented Feb 4, 2021

Retries in this setup go from 10ms to 160 ms (specifically: 0.01 0.02 0.04 0.08 0.16 seconds).

This supersedes the first commit from #397.

  • Closes #xxxx
  • Changelog entry

Co-authored-by: Nefta Kanilmaz <nefta.kanilmaz@gmail.com>
@lr4d lr4d requested a review from fjetter February 4, 2021 10:28
@lr4d lr4d mentioned this pull request Feb 4, 2021
kartothek/serialization/_parquet.py Outdated Show resolved Hide resolved
kartothek/serialization/_parquet.py Outdated Show resolved Hide resolved
@@ -195,6 +163,94 @@ def store(self, store, key_prefix, df):
return key


def _restore_dataframe(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a note regarding git diffs and git blame:

git has a much easier time figuring the history out if you actually do not change too much. Instead, you're moving this entire code section, i.e. we're effectively lossing easier git blame.
This is mostly a style note but I would prefer to define this as a static_method as well to keep diffs small

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made _restore_dataframe into static method, restore_dataframe into class method

if nb_retry < (MAX_NB_RETRIES - 1):
time.sleep(BACKOFF_TIME * 2 ** nb_retry)

raise ParquetReadError(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's your reasoning behind introducing a dedicated class of exceptions? The information it contains is identical to the log message and the type of exception is not reused anywhere

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception type describes the problem and it being a subtype of IOError is appropriate imo. I'd rather be too specific here than too broad

filter_query: Optional[str] = None,
columns: Optional[Iterable[str]] = None,
predicate_pushdown_to_io: bool = True,
categories: Optional[Iterable[str]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need stronger guaranteed here than Iterable; after all, the first try to consume those iterators, leaving wrong arguments on the second try.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably Sequence is then a better candidate. It is very likely that this is not 100% properly done in many places.

Imho, it's not entirely clear, though. afaik, the difference between Sequence and Iterable is that a sequence must define __len__ and __getitem__ but that would exclude sets.

From your comment I would judge we need something like "iterable but not generator" but I do not know which type that should be

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you really need to support sets and lists, I think the right type is Collection[str]. OTOH, why not simply use something concrete here that reflects how it's used?

For instance, categories needs to be a list already, as it's passed to an external function expecting a list.
columns can;t be a set today because some functions call columns[:] to copy it, which does not work for sets ...

Copy link
Collaborator

@fjetter fjetter Feb 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is that explicit, I would argue it is even mandatory. Although for [:] we do not need a List but merely a Sequence, e.g. it works on tuples as well)
Using a sequence/iterable instead of explicit list is a recommendation of mypy since, e.g. sequence is covariant in its variables where a list isn't. It also makes it very explicit that the object is not mutated and allows for more compatible typing.

Some documentation about this can be found here https://mypy.readthedocs.io/en/latest/generics.html#variance-of-generic-types

That all said, I'm fine either way. If mypy doesn't complain, it's good enough for me

Copy link
Collaborator

@fjetter fjetter Feb 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own curiosity

from collections import Iterable, Sequence
In [4]: isinstance([1, 2], Sequence)
Out[4]: True

In [5]: isinstance((1, 2), Sequence)
Out[5]: True

In [6]: isinstance((1, 2), Iterable)
Out[6]: True

In [7]: isinstance({1, 2}, Iterable)
Out[7]: True

In [8]: isinstance({1, 2}, Sequence)
Out[8]: False

In [9]: isinstance((x for x in [1, 2]), Iterable)
Out[9]: True

In [10]: isinstance((x for x in [1, 2]), Sequence)
Out[10]: False

generators are not sequences (makes sense, len, getitem, ...)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my personal opinion, this should be set[str], the logical data type but this would break the tests and is not really the "paradigm" we're using here anyways. I'd also rather have the type be explicit than duck typing.

I just copied the original function signature, we can probably have this debate somewhere else

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said, set is wrong.
Why not just change it to list?

categories: Optional[Iterable[str]] = None,
predicates: Optional[PredicatesType] = None,
date_as_object: bool = False,
) -> pd.DataFrame:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be some explanation here as to why there is a re-try, especially why there is a re-try on AssertionErrors, which is typically considered not a best practice and dangereous: AssertionErrors usually point to logic errors and logic errors can lead to invalid state of the program which is present even for the re-try. Therefore, a usual argument is to simply abort on assertion errors. So it;'s worth explaining why deviating from this here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will just change the underlying error to IOError since that's more appropriate than AssertionError

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment and changed the BlockBuffer error to an error type inheriting from IOError

@fjetter
Copy link
Collaborator

fjetter commented Feb 5, 2021

I'm fine with the changes here. Still, implementing a retry on this abstraction level really doesn't sit well with me. Can we at least create one github issue with the parquet AssertionError we've seen? afaik, we do not have any public breadcrumbs at all and I believe this helps with a justification for this retry if the issue is then linked in the code

# XXX: We have been seeing weired `IOError` while reading Parquet files from Azure Blob Store,
# thus, we implement retries at this point. This code should not live forever, it should be removed
# once the underlying cause has been resolved
for nb_retry in range(MAX_NB_RETRIES):
Copy link
Collaborator

@stephan-hesselmann-by stephan-hesselmann-by Feb 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you implemented the retry logic yourself? I would suggest to use an established library like tenacity. This also has nice features such as being able to overwrite the wait function in tests.

Also I think you should add tests for the retry logic.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very strongly against exposing this as some kind of configuration. Current investigations hint towards a bug in our azure-storage interface and/or the iobuffer interface.

Considering our simple requirement, I am in favour of this approach to not introduce tenacity as a core dependency

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. But especially with the self baked retries I think tests are important because the retry now contains untested code logic.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point about the testing. I would suggest to test this by patching the _restore_dataframe method to raise once but succeeds on the second try. While this test would be very close to the implementation, I believe this is justified for this case

cc @NeroCorleone

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We expect the following behavior:

  • Retry on OSError MAX_NB_RETRIES times or less if success. Log on retry
  • Don't retry on other exceptions
  • Raise ParquetReadError after MAX_NB_RETRIES
  • Retry times use exponential backoff

I think all of these could be individual tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 5 cents:

  • I am all for keeping it simple and stupid and will keep the current retry approach -- especially because we should get rid of it once we can confirm the root cause.
  • Tests +1, the how is unimportant, the behaviour that Stephan mentioned should probably be captured.

@NeroCorleone
Copy link
Contributor

@fjetter fyi #407
I'll reference it in the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants