Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExtensionBlock.take_nd crashes in 1.1.0 #35768

Closed
vmarkovtsev opened this issue Aug 17, 2020 · 7 comments
Closed

ExtensionBlock.take_nd crashes in 1.1.0 #35768

vmarkovtsev opened this issue Aug 17, 2020 · 7 comments
Labels
Needs Info Clarification about behavior needed to assess issue

Comments

@vmarkovtsev
Copy link

vmarkovtsev commented Aug 17, 2020

new_values = self.values.take(indexer, fill_value=fill_value, allow_fill=True)

I've got self.values of type pd.Series with pd.Timestamp-s, and that does not have fill_value and allow_fill, so the kwargs check fails.

athenian/api/controllers/miners/github/branches.py:34: in extract_branches
    for repo, repo_branches in branches.groupby(Branch.repository_full_name.key, sort=False):
/usr/local/lib/python3.8/dist-packages/pandas/core/groupby/ops.py:133: in get_iterator
    for key, (i, group) in zip(keys, splitter):
/usr/local/lib/python3.8/dist-packages/pandas/core/groupby/ops.py:935: in __iter__
    sdata = self._get_sorted_data()
/usr/local/lib/python3.8/dist-packages/pandas/core/groupby/ops.py:948: in _get_sorted_data
    return self.data.take(self.sort_idx, axis=self.axis)
/usr/local/lib/python3.8/dist-packages/pandas/core/generic.py:3341: in take
    new_data = self._mgr.take(
/usr/local/lib/python3.8/dist-packages/pandas/core/internals/managers.py:1414: in take
    return self.reindex_indexer(
/usr/local/lib/python3.8/dist-packages/pandas/core/internals/managers.py:1251: in reindex_indexer
    new_blocks = [
/usr/local/lib/python3.8/dist-packages/pandas/core/internals/managers.py:1252: in <listcomp>
    blk.take_nd(
/usr/local/lib/python3.8/dist-packages/pandas/core/internals/blocks.py:1720: in take_nd
    new_values = self.values.take(indexer, fill_value=fill_value, allow_fill=True)
/usr/local/lib/python3.8/dist-packages/pandas/core/series.py:829: in take
    nv.validate_take(tuple(), kwargs)
/usr/local/lib/python3.8/dist-packages/pandas/compat/numpy/function.py:68: in __call__
    validate_kwargs(fname, kwargs, self.defaults)
/usr/local/lib/python3.8/dist-packages/pandas/util/_validators.py:148: in validate_kwargs
    _check_for_invalid_keys(fname, kwargs, compat_args)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

fname = 'take', kwargs = {'allow_fill': True, 'fill_value': numpy.datetime64('NaT')}, compat_args = OrderedDict([('out', None), ('mode', 'raise')])

    def _check_for_invalid_keys(fname, kwargs, compat_args):
        """
        Checks whether 'kwargs' contains any keys that are not
        in 'compat_args' and raises a TypeError if there is one.
        """
        # set(dict) --> set of the dictionary's keys
        diff = set(kwargs) - set(compat_args)
    
        if diff:
            bad_arg = list(diff)[0]
>           raise TypeError(f"{fname}() got an unexpected keyword argument '{bad_arg}'")
E           TypeError: take() got an unexpected keyword argument 'allow_fill'

/usr/local/lib/python3.8/dist-packages/pandas/util/_validators.py:122: TypeError
(Pdb++) self.values
0   2019-11-01 09:08:16+00:00
1   2017-01-30 18:04:00+00:00
2   2016-12-05 10:59:00+00:00
3   2019-05-16 11:16:00+00:00
Name: commit_date, dtype: datetime64[ns, UTC]

It appeared due to

-> fc.replace(0, pd.NaT, inplace=True)
[50]   /usr/local/lib/python3.8/dist-packages/pandas/core/series.py(4563)replace()
-> return super().replace(
[51]   /usr/local/lib/python3.8/dist-packages/pandas/core/generic.py(6583)replace()
-> return self._update_inplace(result)
[52]   /usr/local/lib/python3.8/dist-packages/pandas/core/generic.py(3955)_update_inplace()
-> self._maybe_update_cacher(verify_is_copy=verify_is_copy)
[53]   /usr/local/lib/python3.8/dist-packages/pandas/core/generic.py(3235)_maybe_update_cacher()
-> ref._maybe_cache_changed(cacher[0], self)
[54]   /usr/local/lib/python3.8/dist-packages/pandas/core/generic.py(3196)_maybe_cache_changed()
-> self._mgr.iset(loc, value)
[55]   /usr/local/lib/python3.8/dist-packages/pandas/core/internals/managers.py(1066)iset()
-> blk.set(blk_locs, value_getitem(val_locs))
[56] > /usr/local/lib/python3.8/dist-packages/pandas/core/internals/blocks.py(1593)set()
-> self.values = values
@jreback
Copy link
Contributor

jreback commented Aug 17, 2020

pls show a user facing example
this is an internal function

@simonjayhawkins simonjayhawkins added the Needs Info Clarification about behavior needed to assess issue label Aug 17, 2020
@vmarkovtsev
Copy link
Author

Sure.

Grab branches.pickle.gz, then run:

import pickle
import pandas as pd

with open("branches.pickle", "rb") as fin:
    branches, dt_cols = pickle.load(fin)

for col in dt_cols:
    fc = branches[col]
    if 0 in fc:
        fc.replace(0, pd.NaT, inplace=True)

for repo, repo_branches in branches.groupby("repository_full_name", sort=False):
    print(repo, repo_branches)

@jbrockmendel
Copy link
Member

Grab branches.pickle.gz,

@vmarkovtsev can you give an example that we can just copy/paste?

@vmarkovtsev
Copy link
Author

I can inline that 2KB pickle file as a bytes literal @jbrockmendel. It comes directly from a database. If you are afraid of loading foreign pickles and are not familiar with docker/VMs, I can dump it as CSV.

@jbrockmendel
Copy link
Member

Pickle safety is a concern, but mainly its the fact that we have 3500 issues to deal with, so making your issue simple to reproduce increases the odds of it getting looked at in a timely manner. https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

@vmarkovtsev
Copy link
Author

Pfff, I have a simple workaround, so whatever.

@bast0006
Copy link

bast0006 commented Jan 11, 2021

I've managed to reproduce this. This is related to pull request #37023, and issues #36953 and #35509. The issue was fixed in release 1.1.2, and is present in 1.1.1.

import pandas as pd
import datetime
from io import StringIO

csv = """a,b
a,2021-01-01 08:00:00+00:00
a,2021-01-01 08:00:00+00:00
a,2021-01-01 08:00:00+00:00
a,2021-01-01 08:00:00+00:00
a,2021-01-01 08:00:00+00:00"""

df = pd.read_csv(StringIO(csv))

df['b'] = pd.to_datetime(df['b'])

df['b'].replace(0, pd.NaT, inplace=True)

print(*df.groupby("a"))
print(df, df.info(verbose=True))

This code reproduces the above exception. If the groupby() call is commented out, another exception is raised in .info():
AttributeError: 'Series' object has no attribute 'reshape'

The timezone piece of the datetime seems required to trigger the bug, strangely enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Info Clarification about behavior needed to assess issue
Projects
None yet
Development

No branches or pull requests

5 participants