-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX-#4593: Ensure Modin warns when setting columns via attributes #4621
Conversation
…ttributes Signed-off-by: Hazem Elmeleegy <hazem@ponder.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start! Could you also check that this is working as mentioned in the issue by writing a small test? Also please don't forget to add your name to the release notes in docs/release_notes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add link to issue #4593 in "Resolves:" in PR description, and add note in release_notes
Codecov Report
@@ Coverage Diff @@
## master #4621 +/- ##
==========================================
+ Coverage 85.11% 89.70% +4.58%
==========================================
Files 230 231 +1
Lines 18438 18733 +295
==========================================
+ Hits 15694 16804 +1110
+ Misses 2744 1929 -815
📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more |
Signed-off-by: Hazem Elmeleegy <hazem@ponder.io>
Signed-off-by: Hazem Elmeleegy <hazem@ponder.io>
modin/pandas/dataframe.py
Outdated
elif key not in dir(self): | ||
if key not in self: | ||
warnings.warn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the logic here is leading to test failures in CI, and this was actually a bug we encountered before where we were improperly setting the attributes. If the key is in self and not in dir, we need to call __setitem__
and return so that we don't call the object.__setattr__
at the bottom of the call. For example, in the failing test case we have this scenario where we are adding a new column as an attribute:
def test___setattr__not_column():
pandas_df = pandas.DataFrame([1, 2, 3])
modin_df = pd.DataFrame([1, 2, 3])
pandas_df.new_col = [4, 5, 6]
modin_df.new_col = [4, 5, 6]
df_equals(modin_df, pandas_df)
The current code as is will check if key is not in dir(self), which is true and will also see that key is not in self. We will print the warning, but we will also call __setitem__
which is not what we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is outdated now. See below.
modin/pandas/dataframe.py
Outdated
elif key in self and key not in dir(self): | ||
elif key not in dir(self) and key not in self and not is_list_like(value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems some what of confusing logic to me. I think the way it was expressed the comment above somewhat clearly: if we have an attribute that already exists (column that already exists), we can go ahead and assign the new value. However, if that's not the case, we should promptly let the user know that we can't set new columns as an attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You were right. The current fix is as simple as replacing isinstance(value, pandas.Series)
with is_list_like(...)
.
Initially we thought that assigning a list to a non-existing column should raise a warning but still add a new column. It then turned out that the correct behavior is to store the list as a new attribute rather than a new column, which simplifies the logic a lot.
@@ -289,6 +290,28 @@ def test___setattr__mutating_column(): | |||
df_equals(modin_df, pandas_df) | |||
assert modin_df.col0.equals(modin_df["col0"]) | |||
|
|||
# Check taht adding a new col via attributes raises warning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Check taht adding a new col via attributes raises warning | |
# Check that adding a new col via attributes raises warning |
self.__setitem__(key, value) | ||
# Note: return immediately so we don't keep this `key` as dataframe state. | ||
# `__getattr__` will return the columns not present in `dir(self)`, so we do not need | ||
# to manually track this state in the `dir`. | ||
return | ||
elif isinstance(value, pandas.Series): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add an extra elif
case that checks to see if the value the user is trying to set is_list_like()
here and throw the warning in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think replacement is sufficient rather than having both checks, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replacement is sufficient since is_list_like()
will always be True for pandas.Series
.
Signed-off-by: Hazem Elmeleegy <hazem@ponder.io>
Signed-off-by: Hazem Elmeleegy <hazem@ponder.io>
with warnings.catch_warnings(): | ||
warnings.simplefilter("error") | ||
modin_df.col1 = [5] | ||
modin_df.col0 = 6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this bit do? Looks like this test is failing here in CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was originally suggested by @RehanSD. I think the idea is to make sure that these statements are not raising warnings. The test does pass locally using pytest. I'm not sure why it doesn't in CI. I removed one of the checks in the last commit, but it's still failing. I'm not yet sure why honestly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I think I might know why. I think the base execution engine is Ray. The test you are failing uses a base execution engine of python so it's failing because it throws a warning that "we are defaulting to pandas". You can try running this locally yourself: pytest -n 2 modin/pandas/test/dataframe/test_iter.py --execution=BaseOnPython
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fixed now by matching on the warning message.
Signed-off-by: Hazem Elmeleegy <hazem@ponder.io>
assert ( | ||
"new_attr" not in modin_df | ||
), "New attribute was not correctly added to columns." | ||
assert modin_df.new_attr == 7, "Modin attribute value was set incorrectly." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we be checking if it's equal to 6 not 7?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
Signed-off-by: Hazem Elmeleegy <hazem@ponder.io>
…ng columns or creating a new scalar attribute Signed-off-by: Hazem Elmeleegy <hazem@ponder.io>
Signed-off-by: Hazem Elmeleegy <hazem@ponder.io>
Signed-off-by: Hazem Elmeleegy <hazem@ponder.io>
self.__setitem__(key, value) | ||
# Note: return immediately so we don't keep this `key` as dataframe state. | ||
# `__getattr__` will return the columns not present in `dir(self)`, so we do not need | ||
# to manually track this state in the `dir`. | ||
return | ||
elif isinstance(value, pandas.Series): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replacement is sufficient since is_list_like()
will always be True for pandas.Series
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just had one question
with warnings.catch_warnings(): | ||
warnings.filterwarnings( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we catching the warnings and filtering them out? Is this so ensure that only these warnings are thrown here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, only the warnings we care about will cause the test to fail if they were raised (they should not!). Other types of warnings we have no control over. They may or may not be raised (possibly depending on the environment), so we want to ignore them (filter them out).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, makes sense
# and adds the provided list as a new attribute and not a column. | ||
with pytest.warns( | ||
UserWarning, | ||
match="Modin doesn't allow columns to be created via a new attribute name - see", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This warning should be probably extended. "see ..." what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
with warnings.catch_warnings(): | ||
warnings.filterwarnings( | ||
action="error", | ||
message="Modin doesn't allow columns to be created via a new attribute name - see", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Signed-off-by: Hazem Elmeleegy <hazem@ponder.io>
Signed-off-by: Hazem Elmeleegy <hazem@ponder.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! @modin-project/modin-core please take a look and give a stamp of approval if all looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@helmeleegy, LGTM, thanks!
Signed-off-by: Hazem Elmeleegy hazem@ponder.io
What do these changes do?
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date