Hypothesis tests for roundtrip to & from pandas #3285

takluyver · 2019-09-06T13:05:13Z

Part of #1846: test roundtripping between xarray DataArray & Dataset and pandas Series & DataFrame.

I haven't particularly tried to hunt down corner cases (e.g. dataframes with 0 columns), in favour of adding tests that currently pass. But these tests probably form a useful platform if you do want to ensure corner cases like that behave nicely - just modify the limits and see what fails.

max-sixty · 2019-09-08T12:20:34Z

This looks great! I'll let someone who knows hypothesis better do a full review. Thanks for submitting @takluyver !

Zac-HD

This looks good to me! Some comments below with config/performance/test tips, but IMO it could easily be merged as-is too 😄

properties/conftest.py

+from hypothesis import settings
+
+# Run for a while - arrays are a bigger search space than usual
+settings.register_profile("ci", deadline=None)


properties/test_pandas_roundtrip.py

+    df.columns.name = "cols"
+    arr = xr.DataArray(df)
+    roundtripped = arr.to_pandas()
+    pd.testing.assert_frame_equal(df, roundtripped)


Zac-HD · 2019-09-10T20:55:07Z

properties/test_pandas_roundtrip.py

+    n_entries = draw(st.integers(min_value=0, max_value=100))
+    dims = ("rows",)
+    vars = {}
+    for _ in range(n_vars):


This pattern - draw a number, then draw that many elements - is tempting but tends to be inefficient when Hypothesis tries to minimse any failures.

The alternative, which we recommend, is to generate collections using the st.lists() strategy - that way Hypothesis will be able to operate in terms of elements of the list.

In this case it's probably only worth doing so for either the vars or entries dimension, and keep the other as-is. If you're keen to do both, it's complicated enough that I'd just fall back on the Hypothesis pandas extension and .map(pd.DataFrame.to_xarray) over the result 😅

I'm not sure how to do this, because in both dimension I want to generate multiple things of the same length - same number of names and arrays for the vars dimension, same number of entries in each array for the entries dimension. If I naively generate lists, they'll have different lengths.

Is it better to generate one such things with the lists strategy, and then make the others to match its length, rather than generating a number to use as the length for all of them? Or is there some overall cleverer way that I'm not seeing?

You could draw indices before the loop, then draw a list of (name, array) tuples. Or in this case you could use st.dictionaries() to, well, generate a list of key-value tuples internally.

The other nice trick would be to draw your index first, and use it's length - deleting elements from that will be slightly more efficient than shrinking the n_elements parameter.

Putting it all togther, I'd write

idx = draw(pdst.indexes(dtype="u8", min_size=0, max_size=100)) vars_strat = st.dictionaries( keys=st.text(), values=npst.arrays(dtype=numeric_dtypes, shape=len(idx)).map(partial(xr.Variable, ("rows",))), min_size=1, max_size=3, ) return xr.Dataset(draw(vars_strat), coords={"rows": idx})

Thanks, that does look neater!

takluyver · 2019-10-11T09:20:16Z

As in my other PR, one suggested addition causes a test failure, and I've put that in the last commit.

@Zac-HD

Following suggestions from @Zac-HD

pep8speaks · 2019-10-12T09:36:03Z

Hello @takluyver! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-10-30 10:01:16 UTC

max-sixty · 2019-10-29T14:35:37Z

This seems so close—could we fix the test (maybe that's just a merging of master?) and merge?

takluyver · 2019-10-29T14:40:05Z

Merged master, crossing fingers that fixes it.

takluyver · 2019-10-29T14:48:52Z

Nope. I don't understand the error, though it looks like astropy has had something similar: astropy/astropy#6424

Also black is now failing on a number of files not affected here.

dcherian

I think the failures are because hypothesis isn't being installed on all CI environments?

properties/conftest.py

dcherian · 2019-10-29T14:54:04Z

properties/test_encode_decode.py

@@ -10,15 +10,10 @@

 import hypothesis.extra.numpy as npst


These may need to be guarded too using pytest.importorskip perhaps? @max-sixty what do you think?

Yes same as here! https://github.com/max-sixty/xarray/blob/black/properties/test_encode_decode.py#L9

Aha, I was being distracted by the other errors around the real one. Let's see if the latest commit helps.

takluyver · 2019-10-29T16:13:52Z

OK, looks like the test failure now is real. Let me know if you want me to comment out the relevant line so the tests pass.

max-sixty · 2019-10-29T19:19:28Z

You're right @takluyver

It looks like hypothesis tests are running in the normal test suites. Anyone know offhand why that is? e.g. https://dev.azure.com/xarray/xarray/_build/results?buildId=1284

(that doesn't solve the test failure, though)

max-sixty · 2019-10-29T19:23:07Z

OK, looks like the test failure now is real. Let me know if you want me to comment out the relevant line so the tests pass.

If we want to merge a subset of the tests then that's fine. Ofc even better if we can use these tests to find & fix the errors

Zac-HD · 2019-10-29T21:54:35Z

OK, looks like the test failure now is real. Let me know if you want me to comment out the relevant line so the tests pass.

If we want to merge a subset of the tests then that's fine. Ofc even better if we can use these tests to find & fix the errors

In my experience it's better to open an issue, add an xfail decorator to the test, and merge the tests PR. Otherwise the initial PR can take a very long time and no other property-based tests get added.

In this case I'd duplicate the test, so there's one which does not allow empty dataframes and one (xfailing) which does.

It's also likely that the person who found the bug is not the best person to fix it, and requiring that they do so in order to merge a useful test just disincentives testing!

shoyer · 2019-10-29T22:09:10Z

In my experience it's better to open an issue, add an xfail decorator to the test, and merge the tests PR. Otherwise the initial PR can take a very long time and no other property-based tests get added.

+1 let's do that!

takluyver · 2019-10-30T10:37:23Z

OK, I've xfailed it.

dcherian · 2019-10-30T14:28:46Z

Opened #3468 . Thanks @takluyver

max-sixty · 2019-10-30T16:33:37Z

Thanks @takluyver. And @Zac-HD for the feedback; v much agree with your approach

* upstream/master: __dask_tokenize__ (pydata#3446) Type check sentinel values (pydata#3472) Fix typo in docstring (pydata#3474) fix test suite warnings re `drop` (pydata#3460) Fix integrate docs (pydata#3469) Fix leap year condition in monthly means example (pydata#3464) Hypothesis tests for roundtrip to & from pandas (pydata#3285) unpin cftime (pydata#3463) Cleanup whatsnew (pydata#3462) enable xr.ALL_DIMS in xr.dot (pydata#3424) Merge stable into master (pydata#3457) upgrade black verison to 19.10b0 (pydata#3456) Remove outdated code related to compatibility with netcdftime (pydata#3450) Remove deprecated behavior from dataset.drop docstring (pydata#3451) jupyterlab dark theme (pydata#3443) Drop groups associated with nans in group variable (pydata#3406) Allow ellipsis (...) in transpose (pydata#3421) Another groupby.reduce bugfix. (pydata#3403) add icomoon license (pydata#3448)

* upstream/master: (27 commits) drop_vars; deprecate drop for variables (pydata#3475) uamiv test using only raw uamiv variables (pydata#3485) Optimize dask array equality checks. (pydata#3453) Propagate indexes in DataArray binary operations. (pydata#3481) python 3.8 tests (pydata#3477) __dask_tokenize__ (pydata#3446) Type check sentinel values (pydata#3472) Fix typo in docstring (pydata#3474) fix test suite warnings re `drop` (pydata#3460) Fix integrate docs (pydata#3469) Fix leap year condition in monthly means example (pydata#3464) Hypothesis tests for roundtrip to & from pandas (pydata#3285) unpin cftime (pydata#3463) Cleanup whatsnew (pydata#3462) enable xr.ALL_DIMS in xr.dot (pydata#3424) Merge stable into master (pydata#3457) upgrade black verison to 19.10b0 (pydata#3456) Remove outdated code related to compatibility with netcdftime (pydata#3450) Remove deprecated behavior from dataset.drop docstring (pydata#3451) jupyterlab dark theme (pydata#3443) ...

takluyver added 7 commits September 6, 2019 12:12

Move hypothesis deadline configuration to conftest.py

1e8ac35

Add simple roundtrip test for xarray-pandas-xarray

9f14426

Test roundtrip pd.Series->DataArray->Series

18790cc

Test roundtrip DataFrame->DataArray->DataFrame

2449ac2

Test roundtrip Dataset->Dataframe->Dataset

54900f0

Relax to allow 0 entries in each dataset var

02fd311

Relax to allow empty string names

e8fb3da

dcherian mentioned this pull request Sep 10, 2019

Add hypothesis test for netCDF4 roundtrip #3283

Open

Zac-HD approved these changes Sep 10, 2019

View reviewed changes

shoyer approved these changes Sep 10, 2019

View reviewed changes

takluyver added 3 commits October 11, 2019 10:05

Add print_blob to config

67c7034

Extra half-roundtrip from pandas series to xarray

4ba4f7b

Extra half roundtrip from pandas dataframe to Xarray

fb222c5

Redesign strategy for generating datasets with 1D variables

7b39a6f

Following suggestions from @Zac-HD

Make pep8 happy

a328739

jhamman mentioned this pull request Oct 12, 2019

[Release] 0.14 #3380

Closed

9 tasks

takluyver added 2 commits October 29, 2019 14:38

Merge branch 'master' into hypothesis-pandas-roundtrip

3f462be

Autoformat test file

ecd016a

dcherian reviewed Oct 29, 2019

View reviewed changes

takluyver added 2 commits October 29, 2019 15:25

Skip hypothesis tests if hypothesis not available

351b40b

Don't require hypothesis for conftest file

044c67d

Mark failing test as xfail

5b0ae82

dcherian mentioned this pull request Oct 30, 2019

failure when roundtripping empty dataset to pandas #3468

Open

dcherian merged commit f115ad1 into pydata:master Oct 30, 2019

max-sixty mentioned this pull request Oct 30, 2019

Restrict pytest to 'unit' tests (i.e. not hypothesis) #3471

Closed

takluyver deleted the hypothesis-pandas-roundtrip branch January 10, 2020 16:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hypothesis tests for roundtrip to & from pandas #3285

Hypothesis tests for roundtrip to & from pandas #3285

takluyver commented Sep 6, 2019

max-sixty commented Sep 8, 2019

Zac-HD left a comment

This comment was marked as resolved.

This comment was marked as resolved.

Zac-HD Sep 10, 2019

takluyver Oct 11, 2019

Zac-HD Oct 11, 2019

takluyver Oct 12, 2019

takluyver commented Oct 11, 2019

pep8speaks commented Oct 12, 2019 •

edited

Loading

max-sixty commented Oct 29, 2019

takluyver commented Oct 29, 2019

takluyver commented Oct 29, 2019

dcherian left a comment

dcherian Oct 29, 2019

max-sixty Oct 29, 2019

takluyver Oct 29, 2019

takluyver commented Oct 29, 2019

max-sixty commented Oct 29, 2019

max-sixty commented Oct 29, 2019

Zac-HD commented Oct 29, 2019 •

edited

Loading

shoyer commented Oct 29, 2019

takluyver commented Oct 30, 2019

dcherian commented Oct 30, 2019

max-sixty commented Oct 30, 2019

Hypothesis tests for roundtrip to & from pandas #3285

Hypothesis tests for roundtrip to & from pandas #3285

Conversation

takluyver commented Sep 6, 2019

max-sixty commented Sep 8, 2019

Zac-HD left a comment

Choose a reason for hiding this comment

This comment was marked as resolved.

This comment was marked as resolved.

Zac-HD Sep 10, 2019

Choose a reason for hiding this comment

takluyver Oct 11, 2019

Choose a reason for hiding this comment

Zac-HD Oct 11, 2019

Choose a reason for hiding this comment

takluyver Oct 12, 2019

Choose a reason for hiding this comment

takluyver commented Oct 11, 2019

pep8speaks commented Oct 12, 2019 • edited Loading

Comment last updated at 2019-10-30 10:01:16 UTC

max-sixty commented Oct 29, 2019

takluyver commented Oct 29, 2019

takluyver commented Oct 29, 2019

dcherian left a comment

Choose a reason for hiding this comment

dcherian Oct 29, 2019

Choose a reason for hiding this comment

max-sixty Oct 29, 2019

Choose a reason for hiding this comment

takluyver Oct 29, 2019

Choose a reason for hiding this comment

takluyver commented Oct 29, 2019

max-sixty commented Oct 29, 2019

max-sixty commented Oct 29, 2019

Zac-HD commented Oct 29, 2019 • edited Loading

shoyer commented Oct 29, 2019

takluyver commented Oct 30, 2019

dcherian commented Oct 30, 2019

max-sixty commented Oct 30, 2019

pep8speaks commented Oct 12, 2019 •

edited

Loading

Zac-HD commented Oct 29, 2019 •

edited

Loading