Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functions to compare Column objects with iterable references and to compare DataFrame objects with mapping references #66

Merged
merged 12 commits into from
Jan 24, 2024

Conversation

anmyachev
Copy link
Contributor

@anmyachev anmyachev commented Jan 5, 2024

The changes are aimed at getting rid of the use of the interchange_to_pandas function, so that the tests were implementation independent.

So far the new functions have only been applied to tests\column folder.

@anmyachev anmyachev force-pushed the compare-columns branch 2 times, most recently from 00e34a6 to 91c4f31 Compare January 5, 2024 01:44
@MarcoGorelli
Copy link
Contributor

nice idea! do we want to check the data type too?

@anmyachev anmyachev force-pushed the compare-columns branch 8 times, most recently from d05595f to a88360c Compare January 6, 2024 02:09
@@ -75,6 +75,9 @@ ignore = [
[tool.ruff.isort]
force-single-line = true

[tool.black]
line-length = 90
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To sync with pre-commit.

@@ -14,9 +14,8 @@
import dataframe_api_compat.pandas_standard
import dataframe_api_compat.polars_standard

DType = TypeVar("DType")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks unused, can return if needed.

@anmyachev anmyachev force-pushed the compare-columns branch 2 times, most recently from 4a20813 to 55944ce Compare January 7, 2024 00:20


def test_column_sorted_indices_ascending(library: str) -> None:
df = integer_dataframe_6(library).persist()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deleted .persist() call in several places, since the same call occurs in new comparison functions, which generates warnings, but due to the repository settings - errors. If this is incorrect, then we need a public way to check the ._is_persisted field, so as not to call the method several times.

pd.testing.assert_frame_equal(result_pd, expected)
expected = {"a": [1, 2, 3], "b": [4, 5, 6], "result": [1.0, 32.0, 729.0]}
expected_dtype = {"a": ns.Int64, "b": ns.Int64, "result": ns.Float64}
compare_dataframe_with_reference(result, expected, expected_dtype) # type: ignore[arg-type]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t know exactly why in some places mypy gives an error that has to be turned off, because it is a false positive. The first thing that catches my eye is that the lists inside the dictionaries have different types, for example int and float (not a homogeneous type).

@@ -104,12 +104,15 @@ def map_pandas_dtype_to_standard_dtype(dtype: Any) -> DType:
return Namespace.Float32()
if dtype == "Float32":
return Namespace.Float32()
if dtype == "bool":
if dtype in ("bool", "boolean"):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discovered it by accident while experimenting. It is possible that this is no longer necessary for the current changes.

@@ -35,6 +35,7 @@
"UInt16": "uint16",
"UInt8": "uint8",
"boolean": "bool",
"Float64": "float64",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also discovered by accident, it seems that the float type was missing, but if it was done on purpose, I can try to redo it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i probably just forgot it - let's add float32 too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@anmyachev anmyachev changed the title Add function to compare Column objects with iterable references Add functions to compare Column objects with iterable references and to compare DataFrame objects with mapping references Jan 7, 2024
@anmyachev anmyachev marked this pull request as ready for review January 7, 2024 00:56
@anmyachev
Copy link
Contributor Author

@MarcoGorelli ready for review :)

@anmyachev
Copy link
Contributor Author

@MarcoGorelli friendly ping :)

A little information for context, after I manage to rewrite the tests in a backend-independent manner, I will try to integrate Modin into your repository. Such preliminary changes are necessary to avoid code duplication.

Copy link
Contributor

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome!

sorry it took a while to get to

just got two minor comments, but this is great

@@ -35,6 +35,7 @@
"UInt16": "uint16",
"UInt8": "uint8",
"boolean": "bool",
"Float64": "float64",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i probably just forgot it - let's add float32 too?

Comment on lines 114 to 115
if not hasattr(dtype, "startswith"):
dtype = str(dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to do this in a less hacky way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can try to use name attribute if it exists.

@anmyachev
Copy link
Contributor Author

@MarcoGorelli there are new deprecation warnings from new polars release:

FAILED tests/groupby/aggregate_test.py::test_aggregate[polars-lazy] - DeprecationWarning: `pl.count()` is deprecated. Please use `pl.len()` instead.
FAILED tests/groupby/aggregate_test.py::test_aggregate_only_size[polars-lazy] - DeprecationWarning: `pl.count()` is deprecated. Please use `pl.len()` instead.
FAILED tests/groupby/size_test.py::test_group_by_size[polars-lazy] - DeprecationWarning: `count` is deprecated. It has been renamed to `len`.

What should I do in this case?

@MarcoGorelli
Copy link
Contributor

easiest thing would be to address that in a separate PR, and to set the new polars release as the minimum version (polars it moving quite fast so backwards compatibility is less of a concern there)

@anmyachev
Copy link
Contributor Author

anmyachev commented Jan 23, 2024

Blocked by #68

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
@@ -19,7 +21,7 @@ def test_scale_column_pandas() -> None:


@pytest.mark.skipif(
tuple(int(v) for v in pl.__version__.split(".")) < (0, 19, 0),
parse(pl.__version__) < Version("0.19.0"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will help the tests work with release candidates, such as polars==0.20.6rc1

@anmyachev
Copy link
Contributor Author

@MarcoGorelli ready for review

Copy link
Contributor

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @anmyachev !

@MarcoGorelli MarcoGorelli merged commit 107969c into data-apis:main Jan 24, 2024
13 checks passed
@anmyachev
Copy link
Contributor Author

thanks for the review @MarcoGorelli!

@anmyachev anmyachev deleted the compare-columns branch January 24, 2024 11:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants