Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#19431] Regression in make_block_same_class (tests failing for new fastparquet release) #19434

Merged
merged 23 commits into from
Jan 29, 2018

Conversation

minggli
Copy link
Contributor

@minggli minggli commented Jan 28, 2018

dtype seems still in use at:
https://github.com/minggli/pandas/blob/2f4fc0790a5e5c51eb80bbfadf4c80f0bb424c56/pandas/core/internals.py#L2911

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

Can you add a test for the internal itself? You can take the small example I gave here (#19431 (comment)) and put it in somewhere in tests/internals/test_internals.py (probably best in the TestBlock class)

# warns on the coercion
with catch_warnings(record=True):
check_round_trip(df, fp, expected=df.astype('datetime64[ns]'))
check_round_trip(df, fp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will need to do this dependent on the fastparquet version, as older versions still return non-tz data (or we could skip the test for older versions)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or probably better: check how it is done for Arrow (test_basic in TestParquetPyarrow), it adds a tz column to the tested dataframe dependent on the version.

@codecov
Copy link

codecov bot commented Jan 28, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@8cbee35). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #19434   +/-   ##
=========================================
  Coverage          ?   91.62%           
=========================================
  Files             ?      150           
  Lines             ?    48724           
  Branches          ?        0           
=========================================
  Hits              ?    44642           
  Misses            ?     4082           
  Partials          ?        0
Flag Coverage Δ
#multiple 89.99% <100%> (?)
#single 41.74% <66.66%> (?)
Impacted Files Coverage Δ
pandas/core/internals.py 95.47% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8cbee35...77422ba. Read the comment docs.

@@ -206,7 +206,7 @@ def array_dtype(self):
"""
return self.dtype

def make_block(self, values, placement=None, ndim=None):
def make_block(self, values, placement=None, ndim=None, dtype=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can add this back, but it needs to be deprecated anyhow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then please answer in the original issue how to achieve the same result (see the code example I posted there, to create a new DatetimeTZ block from a numpy array).
But I don't see why this needs to be deprecated. The dtype argument was actually doing something, in contrast to fastpath (which we therefore deprecated)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fp should be using make_block and not make_block_same_class

df = pd.DataFrame({'dt': pd.date_range('20130101', periods=3)})

# fastparquet supports timezone since 0.1.4, and not before
import fastparquet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to do this in a separate test. Have 1 more fp <= 0.1.3 and 1 for fp > 0.1.3 (essentially duplicate the test with 1 working for the old version and 1 for the new)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to add 1 build with fp pinned to 0.1.3, the so pin in requirements-2.7.sh

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@minggli we can indeed keep a test for the old behaviour (so only run for for older versions), but can you move the new behaviour to test_basic of this class (so we use the same pattern as in the pyarrow test), that is what I meant with my previous issue.

need to add 1 build with fp pinned to 0.1.3, the so pin in requirements-2.7.sh

@jreback We have one build for fastparquet 0.1.1 (that's the reason the tests were failing here in the original PR), so I would say that is enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for sure.

@jreback jreback added the IO Parquet parquet, feather label Jan 28, 2018
@jreback
Copy link
Contributor

jreback commented Jan 28, 2018

I am going to merge #19435 before this to fix the CI for now. you should remove the pins (all except the 2.7 one), and this should still pass (after fixes above).

@@ -483,10 +488,12 @@ def test_categorical(self, fp):
check_round_trip(df, fp)

def test_datetime_tz(self, fp):
# doesn't preserve tz
if LooseVersion(fastparquet.__version__) > LooseVersion('0.1.3'):
pytest.skip("timezone not supported for older fp")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rather "timezones supported in newer versions of fp"

@@ -448,6 +448,11 @@ class TestParquetFastParquet(Base):
def test_basic(self, fp, df_full):
df = df_full

# additional supported types for fastparquet >= 0.1.4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls follow the existing style. see how we have a fixture for pa_ge_070. you need to make similar for fp

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the existing style (see how it is for pyarrow) and what I asked for, so please leave it like this (in this PR).

For the old test we can use a fixture like fp_lt_014 however.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

little confused. But I agree it's better to match existing style in test_basic to avoid unnecessary style variation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a totally incorrect style ATM. So please see what you can fix. skipping INSIDE a test function is not correct, rather a fixture should be used. This must have slipped in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jeff, please leave this. I asked @minggli to do this, and it is the current style. You can do a follow-up PR if you want to change this in all places.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix the 2.7 build (in requirements.sh) to use fastparquet=0.1.3



@pytest.fixture
def fp_ge_014():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you are already using this one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct. removed.

@@ -448,9 +466,11 @@ class TestParquetFastParquet(Base):
def test_basic(self, fp, df_full):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make 2 tests here, one for < 0.1.4 and 1 for >

@@ -482,14 +502,15 @@ def test_categorical(self, fp):
df = pd.DataFrame({'a': pd.Categorical(list('abc'))})
check_round_trip(df, fp)

def test_datetime_tz(self, fp):
# doesn't preserve tz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same need 2 tests (1 with each fixture)

Copy link
Contributor Author

@minggli minggli Jan 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyarrow is using similar style. there are already two tests for datetime_tz, one < 0.1.4 one >.

@jreback
Copy link
Contributor

jreback commented Jan 28, 2018

add a whatsnew entry in other api changes for this. (say compat with fp >= 0.1.4)


def make_block_scalar(self, values):
"""
Create a ScalarBlock
"""
return ScalarBlock(values)

def make_block_same_class(self, values, placement=None, ndim=None):
def make_block_same_class(self, values, placement=None, ndim=None,
dtype=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deprecate dtype here (add as a FutureWarning)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -448,9 +466,11 @@ class TestParquetFastParquet(Base):
def test_basic(self, fp, df_full):
df = df_full

# additional supported types for fastparquet
# additional supported types for fastparquet>=0.1.4
if LooseVersion(pyarrow.__version__) >= LooseVersion('0.1.4'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is wrong, you mean fastparquet, yes? This is why the fixtures are FAR preferred

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Jeff, I am fine with either however, Joris suggested for now to keep it in line with pyarrow style as per #19434 (comment). I feel it's good to keep it consistent and follow up with another PR for styling for both PyArrow and FP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@minggli only the 'pyarrow' -> 'fastparquet' on this line you need to change anyway (copy paste error)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose its ok. The problem is, this will be inconsistent with ALL the rest of pandas. You can make a simple change here to fix it. I would appreciate this done here, or ok in a followup . We spend an enormous amount of time making things consistent. adding inconsistent code is not great, even for expediency.

@@ -312,6 +312,7 @@ Other API Changes
- :func:`DatetimeIndex.shift` and :func:`TimedeltaIndex.shift` will now raise ``NullFrequencyError`` (which subclasses ``ValueError``, which was raised in older versions) when the index object frequency is ``None`` (:issue:`19147`)
- Addition and subtraction of ``NaN`` from a :class:`Series` with ``dtype='timedelta64[ns]'`` will raise a ``TypeError` instead of treating the ``NaN`` as ``NaT`` (:issue:`19274`)
- Set operations (union, difference...) on :class:`IntervalIndex` with incompatible index types will now raise a ``TypeError`` rather than a ``ValueError`` (:issue:`19329`)
- Compatibility in :func:`pandas.date_range` with fastparquet==0.1.4 which now supports timezone e.g. ``tz``='US/Eastern'`` (:issue:`19431`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean read_parquet instead of date_range ?
But, we didn't do anything for this, fastparquet 0.1.4 already works nicely with pandas 0.22, so I don't think this needs a whatsnew notice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct, I am not sure how this whatsnew notice proposed relates to API catalogue. Do you want me to remove this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, we are fixing something that we changed in the unreleased version of pandas, so users should not be aware of this. So I would remove this, but let's wait what @jreback says

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should however fix the note in io.rst ( the note in http://pandas-docs.github.io/pandas-docs-travis/io.html#parquet) about fastparquet not supporting timezones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

off to bed now. thanks guys. hope CI will work normally after this PR.

@minggli
Copy link
Contributor Author

minggli commented Jan 29, 2018

@jorisvandenbossche @jreback please feedback on this PR?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test for the make_block_same_class deprecation? (one that assures the warning is raised, and that it is still working correctly)

if dtype is not None:
# issue 19431 fastparquet is passing this
warnings.warn("dtype argument is deprecated, will be removed "
"in a future release.", FutureWarning)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this a DeprecationWarning instead of FutureWarning, as it is typically only something developers need to see, not users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

post an issue on fastparquet to fix this as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let’s leave this as a FutureWarning
it will encourage fp to fix this as it will be visible

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use the issue I opened yesterday: dask/fastparquet#297

FutureWarning will just be annoying for users in this case, and I am confident fastparquet will change that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it’s changed then we can simply move the bar on fp min version and this is no problem

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are not keeping private API around for external packages
this is way too much work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gents, which is it? FutureWarning or DeprecationWarning? it seems to me that DeprecationWarning is meant to be used here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls leave it as FutureWarning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -448,9 +457,11 @@ class TestParquetFastParquet(Base):
def test_basic(self, fp, df_full):
df = df_full

# additional supported types for fastparquet
# additional supported types for fastparquet>=0.1.4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can leave out the ">=0.1.4" as the timedelta is not specific for the newer versions

@@ -216,20 +216,25 @@ def make_block(self, values, placement=None, ndim=None):
if ndim is None:
ndim = self.ndim

return make_block(values, placement=placement, ndim=ndim)
return make_block(values, placement=placement, ndim=ndim, dtype=dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe remove dtype here? As fastparquet was not using it (and I don't think anybody else will)
Or otherwise add the same deprecation warning as you did for make_block_same_class

@@ -4537,7 +4537,7 @@ See the documentation for `pyarrow <http://arrow.apache.org/docs/python/>`__ and
.. note::

These engines are very similar and should read/write nearly identical parquet format files.
Currently ``pyarrow`` does not support timedelta data, and ``fastparquet`` does not support timezone aware datetimes (they are coerced to UTC).
Currently ``pyarrow`` does not support timedelta data, ``fastparquet`` now supports timezone aware datetimes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to change. you can't say 'now supports' there is no context here. Simply remove the fp statement (or qualify it with fp >1.4)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

with pytest.warns(FutureWarning):
copy = block.make_block_same_class(block.values,
dtype=block.values.dtype)
assert block.dtype == copy.dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need these asserts, just running this is enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -285,6 +285,14 @@ def test_delete(self):
with pytest.raises(Exception):
newb.delete(3)

def test_make_block_same_class(self):
block = create_block('M8[ns, US/Eastern]', [3])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number as a comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

def test_make_block_same_class(self):
# issue 19431
block = create_block('M8[ns, US/Eastern]', [3])
with pytest.warns(FutureWarning):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use tm.assert_produces_warning instead. we catch this usage of pytest.warns in the linter and is not standard for pandas codease.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jorisvandenbossche jorisvandenbossche merged commit 272cabb into pandas-dev:master Jan 29, 2018
@jorisvandenbossche
Copy link
Member

@minggli Thanks a lot! (and sorry for the extensive back and forth)

@minggli
Copy link
Contributor Author

minggli commented Jan 29, 2018

@jorisvandenbossche no problem. happy to help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Parquet parquet, feather
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants