BUG: pd.to_datetime() raises InvalidIndexError with Null-like arguments #35888

blinkseb · 2020-08-25T07:37:44Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd

s = pd.Series([pd.NaT] * 2000 + [None] * 2000, dtype='object')
pd.to_datetime(s)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sbrochet/venvs/tmp-fa372ee62ee9bef/lib64/python3.8/site-packages/pandas/core/tools/datetimes.py", line 801, in to_datetime
    result = arg.map(cache_array)
  File "/home/sbrochet/venvs/tmp-fa372ee62ee9bef/lib64/python3.8/site-packages/pandas/core/series.py", line 3970, in map
    new_values = super()._map_values(arg, na_action=na_action)
  File "/home/sbrochet/venvs/tmp-fa372ee62ee9bef/lib64/python3.8/site-packages/pandas/core/base.py", line 1131, in _map_values
    indexer = mapper.index.get_indexer(values)
  File "/home/sbrochet/venvs/tmp-fa372ee62ee9bef/lib64/python3.8/site-packages/pandas/core/indexes/base.py", line 2980, in get_indexer
    raise InvalidIndexError(
pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Problem description

pd.to_datetime() crash if the input contains a mix of NaT and None with a pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects exception.
This issue is similar to #22305 which was fixed a while ago. It only occurs if the input is large enough to force the caching mechanism.

Expected Output

No crash

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : f2ca0a2
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.7.15-200.fc32.x86_64
Version : #1 SMP Tue Aug 11 16:36:14 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.1.1
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.1.3
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : 0.10.0
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.3.19
tables : None
tabulate : 0.8.7
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

pbjones · 2020-08-26T21:53:11Z

I don't know if it's helpful but I also found the same / similar error today with None and to_datetime. If you are unable to replicate it, I can probably reduce the complexity of mine down (c. 58,000 rows, mixed None with dates).

From my workaround attempts, I found that I can use '' instead of None to get my desired outcome NaT. If memory serves me right, I couldn't do this previously. So amending the example above, this provides 4000 NaTs

    s = pd.Series([pd.NaT] * 2000 + [''] * 2000, dtype='object')
    pd.to_datetime(s)

TomAugspurger · 2020-09-04T19:26:00Z

I think the root problem is something like this: We try to get the unique dates and then pass it to an index.

In [10]: pd.unique(arg)
Out[10]: array([NaT, None], dtype=object)

In [11]: pd.Index(arg)
Out[11]: DatetimeIndex(['NaT', 'NaT'], dtype='datetime64[ns]', freq=None)

But Index infers datetime dtype and converts None to NaT. So we need to somewhere standardize the NA values. I'm not sure exactly where that should be done though. We'd welcome more investigation!

mroeschke · 2021-08-10T04:55:24Z

It appears this works on master now. Could use a test

In [4]: import pandas as pd
   ...:
   ...: s = pd.Series([pd.NaT] * 2000 + [None] * 2000, dtype='object')
   ...: pd.to_datetime(s)
Out[4]:
0      NaT
1      NaT
2      NaT
3      NaT
4      NaT
        ..
3995   NaT
3996   NaT
3997   NaT
3998   NaT
3999   NaT
Length: 4000, dtype: datetime64[ns]

ryangilmour · 2021-08-12T23:10:57Z

My first time contributing - but the PR above should hopefully be a sufficient test. Not sure these tests require an addition to whatsnew?

ryangilmour · 2021-08-12T23:15:31Z

take

ryangilmour · 2021-08-20T14:10:28Z

Looks like this is actually the same issue as #39882 - which was subsequently fixed by #41006.

There are some tests in the fix in #41006, but my PR #43006 adds some more explicit tests for this issue. Will request a review for these once the automated tests are passing!

KhajaFasi · 2021-10-06T17:16:55Z

Can I work on this?

ryangilmour · 2021-10-07T08:56:10Z

Sounds good - I didn't get a chance to work on it after my initial attempt, but my PR (#43006) should have the necessary changes already there, if you can get them working with automated tests, then it should be good to go 🤞🏻 .

ashaypatil12 · 2022-01-17T21:41:56Z

hey @ryangilmour, Greetings. Could you please let me know if you are still working on this or do you need any help.

ryangilmour · 2022-01-19T09:08:30Z

Hi @ashaypatil12 - I'm not currently working on this. If you want to make some progress on this feel free to take a look at the PR (#43006) and see if you can get that over the line.

I didn't get a chance to work on it after my initial attempt, but my PR (#43006) should have the necessary changes already there, if you can get them working with automated tests, then it should be good to go 🤞🏻

blinkseb added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 25, 2020

TomAugspurger added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Datetime Datetime data dtype and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 4, 2020

simonjayhawkins added this to the Contributions Welcome milestone Sep 9, 2020

simonjayhawkins changed the title ~~BUG: pd.to_datetime() throws with Null-like arguments~~ BUG: pd.to_datetime() raises InvalidIndexError with Null-like arguments Sep 9, 2020

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug labels Aug 10, 2021

ryangilmour mentioned this issue Aug 12, 2021

Testing to_datetime, converting none to NaT #43006

Closed

4 tasks

github-actions bot assigned ryangilmour Aug 12, 2021

ryangilmour removed their assignment Jan 19, 2022

mroeschke mentioned this issue Jan 20, 2022

Test to datetime null to NaT #45512

Merged

4 tasks

jreback modified the milestones: Contributions Welcome, 1.5 Jan 21, 2022

jreback closed this as completed in #45512 Feb 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: pd.to_datetime() raises InvalidIndexError with Null-like arguments #35888

BUG: pd.to_datetime() raises InvalidIndexError with Null-like arguments #35888

blinkseb commented Aug 25, 2020

INSTALLED VERSIONS

pbjones commented Aug 26, 2020 •

edited

Loading

TomAugspurger commented Sep 4, 2020

mroeschke commented Aug 10, 2021

ryangilmour commented Aug 12, 2021 •

edited

Loading

ryangilmour commented Aug 12, 2021

ryangilmour commented Aug 20, 2021 •

edited

Loading

KhajaFasi commented Oct 6, 2021

ryangilmour commented Oct 7, 2021

ashaypatil12 commented Jan 17, 2022

ryangilmour commented Jan 19, 2022

BUG: pd.to_datetime() raises InvalidIndexError with Null-like arguments #35888

BUG: pd.to_datetime() raises InvalidIndexError with Null-like arguments #35888

Comments

blinkseb commented Aug 25, 2020

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

pbjones commented Aug 26, 2020 • edited Loading

TomAugspurger commented Sep 4, 2020

mroeschke commented Aug 10, 2021

ryangilmour commented Aug 12, 2021 • edited Loading

ryangilmour commented Aug 12, 2021

ryangilmour commented Aug 20, 2021 • edited Loading

KhajaFasi commented Oct 6, 2021

ryangilmour commented Oct 7, 2021

ashaypatil12 commented Jan 17, 2022

ryangilmour commented Jan 19, 2022

Output of `pd.show_versions()`

pbjones commented Aug 26, 2020 •

edited

Loading

ryangilmour commented Aug 12, 2021 •

edited

Loading

ryangilmour commented Aug 20, 2021 •

edited

Loading