Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: pd.to_datetime() raises InvalidIndexError with Null-like arguments #35888

Closed
2 of 3 tasks
blinkseb opened this issue Aug 25, 2020 · 10 comments · Fixed by #45512
Closed
2 of 3 tasks

BUG: pd.to_datetime() raises InvalidIndexError with Null-like arguments #35888

blinkseb opened this issue Aug 25, 2020 · 10 comments · Fixed by #45512
Labels
Datetime Datetime data dtype good first issue Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@blinkseb
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd

s = pd.Series([pd.NaT] * 2000 + [None] * 2000, dtype='object')
pd.to_datetime(s)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sbrochet/venvs/tmp-fa372ee62ee9bef/lib64/python3.8/site-packages/pandas/core/tools/datetimes.py", line 801, in to_datetime
    result = arg.map(cache_array)
  File "/home/sbrochet/venvs/tmp-fa372ee62ee9bef/lib64/python3.8/site-packages/pandas/core/series.py", line 3970, in map
    new_values = super()._map_values(arg, na_action=na_action)
  File "/home/sbrochet/venvs/tmp-fa372ee62ee9bef/lib64/python3.8/site-packages/pandas/core/base.py", line 1131, in _map_values
    indexer = mapper.index.get_indexer(values)
  File "/home/sbrochet/venvs/tmp-fa372ee62ee9bef/lib64/python3.8/site-packages/pandas/core/indexes/base.py", line 2980, in get_indexer
    raise InvalidIndexError(
pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Problem description

pd.to_datetime() crash if the input contains a mix of NaT and None with a pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects exception.
This issue is similar to #22305 which was fixed a while ago. It only occurs if the input is large enough to force the caching mechanism.

Expected Output

No crash

Output of pd.show_versions()

INSTALLED VERSIONS

commit : f2ca0a2
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.7.15-200.fc32.x86_64
Version : #1 SMP Tue Aug 11 16:36:14 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.1.1
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.1.3
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : 0.10.0
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.3.19
tables : None
tabulate : 0.8.7
xarray : None
xlrd : None
xlwt : None
numba : None

@blinkseb blinkseb added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 25, 2020
@pbjones
Copy link

pbjones commented Aug 26, 2020

I don't know if it's helpful but I also found the same / similar error today with None and to_datetime. If you are unable to replicate it, I can probably reduce the complexity of mine down (c. 58,000 rows, mixed None with dates).

From my workaround attempts, I found that I can use '' instead of None to get my desired outcome NaT. If memory serves me right, I couldn't do this previously. So amending the example above, this provides 4000 NaTs

    s = pd.Series([pd.NaT] * 2000 + [''] * 2000, dtype='object')
    pd.to_datetime(s)

@TomAugspurger
Copy link
Contributor

I think the root problem is something like this: We try to get the unique dates and then pass it to an index.

In [10]: pd.unique(arg)
Out[10]: array([NaT, None], dtype=object)

In [11]: pd.Index(arg)
Out[11]: DatetimeIndex(['NaT', 'NaT'], dtype='datetime64[ns]', freq=None)

But Index infers datetime dtype and converts None to NaT. So we need to somewhere standardize the NA values. I'm not sure exactly where that should be done though. We'd welcome more investigation!

@TomAugspurger TomAugspurger added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Datetime Datetime data dtype and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 4, 2020
@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone Sep 9, 2020
@simonjayhawkins simonjayhawkins changed the title BUG: pd.to_datetime() throws with Null-like arguments BUG: pd.to_datetime() raises InvalidIndexError with Null-like arguments Sep 9, 2020
@mroeschke
Copy link
Member

It appears this works on master now. Could use a test

In [4]: import pandas as pd
   ...:
   ...: s = pd.Series([pd.NaT] * 2000 + [None] * 2000, dtype='object')
   ...: pd.to_datetime(s)
Out[4]:
0      NaT
1      NaT
2      NaT
3      NaT
4      NaT
        ..
3995   NaT
3996   NaT
3997   NaT
3998   NaT
3999   NaT
Length: 4000, dtype: datetime64[ns]

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug labels Aug 10, 2021
@ryangilmour
Copy link
Contributor

ryangilmour commented Aug 12, 2021

My first time contributing - but the PR above should hopefully be a sufficient test. Not sure these tests require an addition to whatsnew?

@ryangilmour
Copy link
Contributor

take

@ryangilmour
Copy link
Contributor

ryangilmour commented Aug 20, 2021

Looks like this is actually the same issue as #39882 - which was subsequently fixed by #41006.

There are some tests in the fix in #41006, but my PR #43006 adds some more explicit tests for this issue. Will request a review for these once the automated tests are passing!

@KhajaFasi
Copy link

Can I work on this?

@ryangilmour
Copy link
Contributor

Sounds good - I didn't get a chance to work on it after my initial attempt, but my PR (#43006) should have the necessary changes already there, if you can get them working with automated tests, then it should be good to go 🤞🏻 .

@ashaypatil12
Copy link

hey @ryangilmour, Greetings. Could you please let me know if you are still working on this or do you need any help.

@ryangilmour
Copy link
Contributor

Hi @ashaypatil12 - I'm not currently working on this. If you want to make some progress on this feel free to take a look at the PR (#43006) and see if you can get that over the line.

I didn't get a chance to work on it after my initial attempt, but my PR (#43006) should have the necessary changes already there, if you can get them working with automated tests, then it should be good to go 🤞🏻

@ryangilmour ryangilmour removed their assignment Jan 19, 2022
@jreback jreback modified the milestones: Contributions Welcome, 1.5 Jan 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype good first issue Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
9 participants