API: value-dependent behaviour in concat with all-NA data #40893

jorisvandenbossche · 2021-04-12T10:50:11Z

In general, we want to get rid of value-dependent behaviour in concat-operations: the resulting dtype of a concat-operation only depends on the input dtypes, and not on the exact content (the exact values) of the inputs.

This has been discussed in the past on general occasions, eg in #33607 when adding the general EA interface for concat (there is still one value-dependent special case for Categorical involving integer categories / missing values, encoded in core/dtypes/concat.py::cast_to_common_type), or #39122 about this issue when concerning empty series/dataframes.

But so one other case (which came up recently in eg #39574 and #39612) is related to all-NA/NaN objects.

For DataFrames, when there is all-missing column, its type gets ignored when determining the result dtype (which, however, requires inspecting the values of the column). Small example:

>>> df_missing = pd.DataFrame({'a': [np.nan]})
>>> df_dt64 = pd.DataFrame({'a': [pd.Timestamp("2021-01-01")]}, dtype="datetime64[ns]")

>>> pd.concat([df_missing, df_dt64])
           a
0        NaT
0 2021-01-01

>>> pd.concat([df_missing, df_dt64]).dtypes
a    datetime64[ns]
dtype: object

This can be useful, as you can get such object/float dtype columns depending on how those "empty" all-NaN DataFrames are created (eg when constructing a DataFrame with given index/column but without data, or by reindexing the rows of an actual empty DataFrame, or reindexing the columns of a non-empty DataFrame).

However, it does introduce annoying value-dependent behaviour, and is also not very consistent throughout pandas. For example, Series does not check for this, and will actually result in object dtype:

>>> pd.concat([df_missing['a'], df_dt64['a']])
0                    NaN
0    2021-01-01 00:00:00
Name: a, dtype: object

Further, this is also not consistent across data types. For example, we don't check for all-NA for the new nullable dtypes.

For ArrayManager, I didn't yet implement any special case value-dependent behaviour (#39612, so on this aspect it diverges from the BlockManager behaviour), as it would be good to first decide on the desired behaviour long term.

The text was updated successfully, but these errors were encountered:

jorisvandenbossche added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Dtype Conversions Unexpected or buggy dtype conversions API Design labels Apr 12, 2021

jorisvandenbossche mentioned this issue Apr 12, 2021

[ArrayManager] REF: Implement concat with reindexing #39612

Merged

mroeschke added the Needs Discussion Requires discussion from core team before further action label Aug 19, 2021

jbrockmendel mentioned this issue Mar 9, 2022

BUG: concat with empty dataframe with columns passed and nonempty dataframe coerces dtype to object #45637

Closed

3 tasks

jorisvandenbossche mentioned this issue Jun 17, 2022

REGR: revert behaviour change for concat with empty/all-NaN data #47372

Merged

jbrockmendel mentioned this issue Oct 30, 2022

API: Breaking Changes in 3.0 (without deprecations) #44823

Open

8 tasks

jbrockmendel mentioned this issue Apr 12, 2023

DEPR: concat ignoring all-NA columns #52613

Merged

5 tasks

mroeschke closed this as completed in #52613 Apr 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: value-dependent behaviour in concat with all-NA data #40893

API: value-dependent behaviour in concat with all-NA data #40893

jorisvandenbossche commented Apr 12, 2021

API: value-dependent behaviour in concat with all-NA data #40893

API: value-dependent behaviour in concat with all-NA data #40893

Comments

jorisvandenbossche commented Apr 12, 2021