Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX-#4593: Ensure Modin warns when setting columns via attributes #4621

Merged
merged 13 commits into from
Jul 2, 2022

Conversation

helmeleegy
Copy link
Collaborator

@helmeleegy helmeleegy commented Jun 30, 2022

Signed-off-by: Hazem Elmeleegy hazem@ponder.io

What do these changes do?

  • commit message follows format outlined here
  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves Modin should match pandas UserWarning for assigning columns via attributes #4593
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date
  • added (Issue Number: PR title (PR Number)) and github username to release notes for next major release

…ttributes

Signed-off-by: Hazem Elmeleegy <hazem@ponder.io>
@helmeleegy helmeleegy requested a review from a team as a code owner June 30, 2022 00:56
Copy link
Collaborator

@pyrito pyrito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start! Could you also check that this is working as mentioned in the issue by writing a small test? Also please don't forget to add your name to the release notes in docs/release_notes

modin/pandas/dataframe.py Outdated Show resolved Hide resolved
Copy link

@jeffreykennethli jeffreykennethli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add link to issue #4593 in "Resolves:" in PR description, and add note in release_notes

@codecov
Copy link

codecov bot commented Jun 30, 2022

Codecov Report

Merging #4621 (5ea1a05) into master (2de5c67) will increase coverage by 4.58%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4621      +/-   ##
==========================================
+ Coverage   85.11%   89.70%   +4.58%     
==========================================
  Files         230      231       +1     
  Lines       18438    18733     +295     
==========================================
+ Hits        15694    16804    +1110     
+ Misses       2744     1929     -815     
Impacted Files Coverage Δ
modin/pandas/dataframe.py 91.69% <100.00%> (+0.10%) ⬆️
...odin/core/storage_formats/pandas/query_compiler.py 96.16% <0.00%> (-0.03%) ⬇️
modin/pandas/groupby.py 93.68% <0.00%> (ø)
modin/experimental/batch/test/test_pipeline.py 100.00% <0.00%> (ø)
modin/pandas/base.py 95.30% <0.00%> (+0.08%) ⬆️
...mentations/pandas_on_ray/partitioning/partition.py 93.57% <0.00%> (+1.83%) ⬆️
...tations/pandas_on_python/partitioning/partition.py 93.75% <0.00%> (+2.08%) ⬆️
...entations/pandas_on_dask/partitioning/partition.py 91.46% <0.00%> (+2.43%) ⬆️
modin/pandas/__init__.py 69.69% <0.00%> (+3.03%) ⬆️
... and 24 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

Signed-off-by: Hazem Elmeleegy <hazem@ponder.io>
Signed-off-by: Hazem Elmeleegy <hazem@ponder.io>
Signed-off-by: Hazem Elmeleegy <hazem@ponder.io>
Comment on lines 2503 to 2505
elif key not in dir(self):
if key not in self:
warnings.warn(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the logic here is leading to test failures in CI, and this was actually a bug we encountered before where we were improperly setting the attributes. If the key is in self and not in dir, we need to call __setitem__ and return so that we don't call the object.__setattr__ at the bottom of the call. For example, in the failing test case we have this scenario where we are adding a new column as an attribute:

def test___setattr__not_column():
    pandas_df = pandas.DataFrame([1, 2, 3])
    modin_df = pd.DataFrame([1, 2, 3])
    pandas_df.new_col = [4, 5, 6]
    modin_df.new_col = [4, 5, 6]
    df_equals(modin_df, pandas_df)

The current code as is will check if key is not in dir(self), which is true and will also see that key is not in self. We will print the warning, but we will also call __setitem__ which is not what we want.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is outdated now. See below.

elif key in self and key not in dir(self):
elif key not in dir(self) and key not in self and not is_list_like(value):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems some what of confusing logic to me. I think the way it was expressed the comment above somewhat clearly: if we have an attribute that already exists (column that already exists), we can go ahead and assign the new value. However, if that's not the case, we should promptly let the user know that we can't set new columns as an attribute.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You were right. The current fix is as simple as replacing isinstance(value, pandas.Series) with is_list_like(...).

Initially we thought that assigning a list to a non-existing column should raise a warning but still add a new column. It then turned out that the correct behavior is to store the list as a new attribute rather than a new column, which simplifies the logic a lot.

@@ -289,6 +290,28 @@ def test___setattr__mutating_column():
df_equals(modin_df, pandas_df)
assert modin_df.col0.equals(modin_df["col0"])

# Check taht adding a new col via attributes raises warning
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Check taht adding a new col via attributes raises warning
# Check that adding a new col via attributes raises warning

self.__setitem__(key, value)
# Note: return immediately so we don't keep this `key` as dataframe state.
# `__getattr__` will return the columns not present in `dir(self)`, so we do not need
# to manually track this state in the `dir`.
return
elif isinstance(value, pandas.Series):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add an extra elif case that checks to see if the value the user is trying to set is_list_like() here and throw the warning in this case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think replacement is sufficient rather than having both checks, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replacement is sufficient since is_list_like() will always be True for pandas.Series.

Signed-off-by: Hazem Elmeleegy <hazem@ponder.io>
Signed-off-by: Hazem Elmeleegy <hazem@ponder.io>
Comment on lines 300 to 303
with warnings.catch_warnings():
warnings.simplefilter("error")
modin_df.col1 = [5]
modin_df.col0 = 6
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this bit do? Looks like this test is failing here in CI

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was originally suggested by @RehanSD. I think the idea is to make sure that these statements are not raising warnings. The test does pass locally using pytest. I'm not sure why it doesn't in CI. I removed one of the checks in the last commit, but it's still failing. I'm not yet sure why honestly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I think I might know why. I think the base execution engine is Ray. The test you are failing uses a base execution engine of python so it's failing because it throws a warning that "we are defaulting to pandas". You can try running this locally yourself: pytest -n 2 modin/pandas/test/dataframe/test_iter.py --execution=BaseOnPython

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fixed now by matching on the warning message.

Signed-off-by: Hazem Elmeleegy <hazem@ponder.io>
assert (
"new_attr" not in modin_df
), "New attribute was not correctly added to columns."
assert modin_df.new_attr == 7, "Modin attribute value was set incorrectly."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we be checking if it's equal to 6 not 7?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

Signed-off-by: Hazem Elmeleegy <hazem@ponder.io>
…ng columns or creating a new scalar attribute

Signed-off-by: Hazem Elmeleegy <hazem@ponder.io>
Signed-off-by: Hazem Elmeleegy <hazem@ponder.io>
Signed-off-by: Hazem Elmeleegy <hazem@ponder.io>
naren-ponder
naren-ponder previously approved these changes Jul 1, 2022
self.__setitem__(key, value)
# Note: return immediately so we don't keep this `key` as dataframe state.
# `__getattr__` will return the columns not present in `dir(self)`, so we do not need
# to manually track this state in the `dir`.
return
elif isinstance(value, pandas.Series):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replacement is sufficient since is_list_like() will always be True for pandas.Series.

pyrito
pyrito previously approved these changes Jul 1, 2022
Copy link
Collaborator

@pyrito pyrito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just had one question

Comment on lines +301 to +302
with warnings.catch_warnings():
warnings.filterwarnings(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we catching the warnings and filtering them out? Is this so ensure that only these warnings are thrown here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, only the warnings we care about will cause the test to fail if they were raised (they should not!). Other types of warnings we have no control over. They may or may not be raised (possibly depending on the environment), so we want to ignore them (filter them out).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, makes sense

# and adds the provided list as a new attribute and not a column.
with pytest.warns(
UserWarning,
match="Modin doesn't allow columns to be created via a new attribute name - see",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning should be probably extended. "see ..." what?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

with warnings.catch_warnings():
warnings.filterwarnings(
action="error",
message="Modin doesn't allow columns to be created via a new attribute name - see",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Signed-off-by: Hazem Elmeleegy <hazem@ponder.io>
@helmeleegy helmeleegy dismissed stale reviews from pyrito and naren-ponder via 79f67c2 July 1, 2022 22:58
Signed-off-by: Hazem Elmeleegy <hazem@ponder.io>
Copy link
Collaborator

@pyrito pyrito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! @modin-project/modin-core please take a look and give a stamp of approval if all looks good!

Copy link
Collaborator

@YarShev YarShev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@helmeleegy, LGTM, thanks!

@YarShev YarShev merged commit 82746b9 into modin-project:master Jul 2, 2022
YarShev pushed a commit that referenced this pull request Sep 6, 2022
)

Signed-off-by: Hazem Elmeleegy <hazem@ponder.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Modin should match pandas UserWarning for assigning columns via attributes
5 participants