Add allow_nan_equality option to assert_approx_df_equality #29

mitches-got-glitches · 2021-04-26T16:36:31Z

This pull request solves the issue by making a fairly big change to the API. Now, rather than having two assertion functions for both DataFrame and column comparison, there is just one for each. Both these assertion functions have an optional parameter for precision, meaning the assert_approx_df_equality and assert_approx_column_equality could be removed.

In addition to solving the ticket, I feel like I made a number of improvements to the code base:

Added some useful spacing
Improved the type hinting
Documented the functions with docstrings using the numpy docstring convention
Added CHANGELOG.md to be updated with each release

Actions for @MrPowers :

Review and confirm happy with API changes
Review renaming of exception to RowsNotEqualError and consider any implications
Update the screenshots in the README.md to reflect the new changes

When making a new release:

Bump to a new minor release, and use any other means to notify users of API change
Replace "Unreleased" in the CHANGELOG.md with the new version number and the date released.
Add the Changelog notes to the Release Notes on GitHub once all tagged.

Let me know what you think!

Closes #28

The objective of the issue was to make allow_nan_equality available to assert_approx_df_equality, but on reviewing the code it seemed like a more efficient plan to include the precision option in assert_df_equality and make it optional. This results in only one set of functions needed to be maintained. I've moved the single assert_rows_equality function into row_comparer along with the custom exception which I've renamed. I've made one generic equality function which takes options. All the tests have been updated although there could be scope for a few more new tests - I wanted to consult on that first. This is a fairly large change to the API for which a major version update would be recommended.

Changed the section on assert approx df equality.

I've removed assert_approx_column_equality from the API.

MrPowers

Is there some part where you call dtypes or schema to figure out which columns are floating / decimal point (and need approximate comparison) and which columns need exact equality comparison?

This refactoring looks promising. I need to understand more about if it causes any Python version limitations, backwards incompatibility issues, or performance drains compared to what we currently have. It's certainly more elegant, but this lib is already in production for a variety of users and want to make sure it's not going to be a breaking change for anyone. Thanks!

chispa/number_helpers.py

MrPowers · 2021-05-11T15:56:52Z

chispa/number_helpers.py

+
+    """
+    both_floats = (isinstance(x, float) & isinstance(y, float))


I don't think we need to do this cause we know the types from the schema.

Hi @MrPowers , can you have a look at my draft pull request here to see if I'm going along the right lines with your 3rd objective?

To summarise, rather than the following line in check_equal:

both_floats = (isinstance(x, float) & isinstance(y, float))

I'm doing this:

is_float_type = (dtype_name in ['float', 'double', 'decimal'])

Where dtype gets passed into the function, and is created with the following line:

dtypes = [field.dataType.typeName() for field in df1.schema]

Because we've already compared the schemas, we know that they're equal so we just use df1.schema.

Will using isinstance really have a noticeable effect on the speed of the tests? I feel like it might be the safer option.

I just did some further testing too, and my current solution would not work for pyspark columns with DecimalType, as the python type is decimal.Decimal not float. I'd need to change that if the original implementation is selected.

Any time to give me a steer this week @MrPowers ? As in point me in the right direction for the solution (sorry realised giving a steer might be a localised phrase) - where are you from anyway?

MrPowers · 2021-05-11T18:00:04Z

@mitches-got-glitches - thanks for following up and sorry for the delay.

Here are my main objectives with this PR:

Get your new functionality included in the lib cause it's clearly an improvement
No backwards incompatible changes (this lib follows semantic versioning, so we could technically make backwards incompatible changes cause we haven't made a major release yet, but I'd rather just avoid breaking changes, like Spark does)
Keep these functions as fast as possible (cause Spark tests can get really slow). Don't want to unnecessarily do type checking and slow functions down.
Support all the Python versions that we should reasonably be supporting. @nchammas - do you know the Python versions that a modern Spark app should be supporting? I think supporting > 2.7 is too lenient cause Python 2 is EOL.

Looking forward to driving this towards completion & appreciate your persistence! BTW - I'm a Python noob, so you're teaching me a lot!

nchammas · 2021-05-11T18:10:19Z

@nchammas - do you know the Python versions that a modern Spark app should be supporting? I think supporting > 2.7 is too lenient cause Python 2 is EOL.

Python 3.6+ is what's currently supported. Spark 3.1 dropped support for Python 2.7, 3.4, and 3.5.

mitches-got-glitches · 2021-05-12T12:46:48Z

@nchammas - do you know the Python versions that a modern Spark app should be supporting? I think supporting > 2.7 is too lenient cause Python 2 is EOL.

Python 3.6+ is what's currently supported. Spark 3.1 dropped support for Python 2.7, 3.4, and 3.5.

Since the Python typing module is available in >=3.5 does this resolve your earlier comment @MrPowers ?

MrPowers · 2021-05-12T13:08:31Z

Looks like Spark 2.4.5 technically supports Python 3.4+ and I'd still like to support the Spark 2 users for a bit longer. That said, seems reasonable to set the Python version as >=3.5 for this project. Python 3.5 was released in September 2015 and think most users will at least be using Python 3.5, even the ones that are still using Spark 2.

Sidenote, when chispa eventually drops support for PySpark 2, we should do a major release (i.e. bump chispa to 1.x).

I will bump the Python version in the project now and add a section to explain this to users.

mitches-got-glitches · 2021-05-12T13:14:22Z

To add another data point, my team is currently using Python 3.6.8 and Spark 2.4.

MrPowers · 2021-05-12T13:33:12Z

@mitches-got-glitches - great, glad we're still supporting PySpark 2.4 😉

Updated the required Python version.

So yes, the typing module question is resolved. We're requiring Python 3.5+ now, so using typing should be just fine 😃

MrPowers · 2021-05-12T14:57:13Z

@mitches-got-glitches - BTW, see here for the chispa download statistics. If this lib didn't have any users yet, then I wouldn't really care about breaking API changes. Since there are a lot of users, that's why I'm also focusing on backwards compatibility. A blessing and a curse I suppose 😉

mitches-got-glitches · 2021-05-12T16:17:54Z

Yeah, those are some good numbers! I think what you've set out to do is exactly right with the change to the readme and intention to launch 1.x. Users can always open a ticket for backwards compatibility if it's really necessary, or even create their own fork. But I agree, try not to get on their bad side ha.

So hopefully, once the thread above is resolved, then I think all that will be required is an update to the screenshots in the readme if you're happy with everything else @MrPowers ?

MrPowers · 2021-05-21T12:08:22Z

@mitches-got-glitches - Thanks for following up and sorry I've been delayed responding. I've been working on my first Dask PR and it's been hard for me to get up-and-running on a completely new framework. I'm still a Python n00b.

My initial hesitation for merging this PR was that it wasn't 100% backwards compatible. Is that still the case? Or does the current PR introduce some backwards incompatible changes?

mitches-got-glitches · 2021-06-08T17:15:14Z

@mitches-got-glitches - Thanks for following up and sorry I've been delayed responding. I've been working on my first Dask PR and it's been hard for me to get up-and-running on a completely new framework. I'm still a Python n00b.

My initial hesitation for merging this PR was that it wasn't 100% backwards compatible. Is that still the case? Or does the current PR introduce some backwards incompatible changes?

Hello, sorry for the slow reply. I think in terms of backwards compatibility there were two things:

The type hinting that requires Python 3.5+ (this was resolved in the comments above)
Backwards capability for the API (retaining assert_approx_df_equality and equivalent assert_approx_column_equality)

I believe this can be done by doing something like the following with a deprecation warning so users know to switch their code base over to assert_df_equality in future versions:

def assert_approx_df_equality(*args, **kwargs):
    assert_df_equality(*args, **kwargs)

Does that look like a good idea?

There are two more outstanding things I need your help on before this can come out of draft status:

Firstly, an answer to the replies to your comment above.
Updating the screenshots in the README.md to reflect the new changes

Are you able to support me with those @MrPowers ?

mitches-got-glitches added 12 commits April 26, 2021 13:59

Made use of blue function in row/column comparer modules

b94d4d5

Spelling fix

e74368e

Update README examples to reflect assert_df_equality changes

5fe032a

Changed the section on assert approx df equality.

Add changes to asserting approx df equality to Changelog

958b19c

Change func name equality to check_equal

34fd432

Change assert_column_equality to accept optional precision param

c3b907c

I've removed assert_approx_column_equality from the API.

Remove assert_approx_df_equality from the API

147bb57

Update README to reflect changes to assert_column_equality

3ff1f18

Update Changelog with assert_column_equality changes

47f900a

Remove test run code at bottom of column_comparer

461c978

Remove some unnecessary comments

0533904

mitches-got-glitches marked this pull request as ready for review April 27, 2021 08:50

mitches-got-glitches marked this pull request as draft April 27, 2021 08:50

MrPowers reviewed May 3, 2021

View reviewed changes

chispa/number_helpers.py Show resolved Hide resolved

mitches-got-glitches mentioned this pull request May 10, 2021

Add support for NaN equality within Arrays #27

Open

MrPowers reviewed May 11, 2021

View reviewed changes

bweisberg mentioned this pull request Dec 17, 2021

DataFramesNotEqualError when dataframes appear identical #37

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add allow_nan_equality option to assert_approx_df_equality #29

Add allow_nan_equality option to assert_approx_df_equality #29

mitches-got-glitches commented Apr 26, 2021

MrPowers left a comment

MrPowers May 11, 2021

mitches-got-glitches May 12, 2021

mitches-got-glitches May 12, 2021

mitches-got-glitches May 17, 2021 •

edited

Loading

MrPowers commented May 11, 2021

nchammas commented May 11, 2021

mitches-got-glitches commented May 12, 2021

MrPowers commented May 12, 2021

mitches-got-glitches commented May 12, 2021

MrPowers commented May 12, 2021

MrPowers commented May 12, 2021

mitches-got-glitches commented May 12, 2021 •

edited

Loading

MrPowers commented May 21, 2021

mitches-got-glitches commented Jun 8, 2021


		"""
		both_floats = (isinstance(x, float) & isinstance(y, float))

Add allow_nan_equality option to assert_approx_df_equality #29

Are you sure you want to change the base?

Add allow_nan_equality option to assert_approx_df_equality #29

Conversation

mitches-got-glitches commented Apr 26, 2021

MrPowers left a comment

Choose a reason for hiding this comment

MrPowers May 11, 2021

Choose a reason for hiding this comment

mitches-got-glitches May 12, 2021

Choose a reason for hiding this comment

mitches-got-glitches May 12, 2021

Choose a reason for hiding this comment

mitches-got-glitches May 17, 2021 • edited Loading

Choose a reason for hiding this comment

MrPowers commented May 11, 2021

nchammas commented May 11, 2021

mitches-got-glitches commented May 12, 2021

MrPowers commented May 12, 2021

mitches-got-glitches commented May 12, 2021

MrPowers commented May 12, 2021

MrPowers commented May 12, 2021

mitches-got-glitches commented May 12, 2021 • edited Loading

MrPowers commented May 21, 2021

mitches-got-glitches commented Jun 8, 2021

mitches-got-glitches May 17, 2021 •

edited

Loading

mitches-got-glitches commented May 12, 2021 •

edited

Loading