Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.23.1 concat drops the name of the merge axis when not aligned #21629

Closed
pdemarti opened this issue Jun 25, 2018 · 7 comments
Closed

0.23.1 concat drops the name of the merge axis when not aligned #21629

pdemarti opened this issue Jun 25, 2018 · 7 comments
Labels
Duplicate Report Duplicate issue or pull request Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@pdemarti
Copy link

pdemarti commented Jun 25, 2018

Code Sample, a copy-pastable example if possible

When the columns are aligned, no problem, the columns in the result have the correct name ('ID' here).

pd.concat([
    pd.DataFrame([[0, 1]], index=['r0'], columns=pd.Index(['a', 'b'], name='ID')),
    pd.DataFrame([[2, 3]], index=['r1'], columns=pd.Index(['a', 'b'], name='ID')),
], sort=True)

# out:
# ID  a  b
# r0  0  1
# r1  2  3

However, when the columns are not aligned, then the name seems to disappear:

pd.concat([
    pd.DataFrame([[0, 1]], index=['r0'], columns=pd.Index(['a', 'b'], name='ID')),
    pd.DataFrame([[2, 3]], index=['r1'], columns=pd.Index(['a', 'c'], name='ID')),
], sort=True)

# out:
#     a    b    c      <-- notice how the columns have lost their name ('ID').
# r0  0  1.0  NaN
# r1  2  NaN  3.0

Problem description

When concatenating DataFrames, I expect the non-concatenating axis (the columns axis, in the examples above) to keep its name(s).

An interesting question occurs if the instances of the non-concatenating axis are not only misaligned, but also have different names. In that case, we could use the majority value or drop altogether (None). In our code, we use names = collections.Counter([df.axes[nc_axis].names for df in objs]).most_common(1)[0][0].

Expected Output

pd.concat([
    pd.DataFrame([[0, 1]], index=['r0'], columns=pd.Index(['a', 'b'], name='ID')),
    pd.DataFrame([[2, 3]], index=['r1'], columns=pd.Index(['a', 'c'], name='ID')),
], sort=True)

# out:
# ID  a    b    c      <-- name 'ID' should be retained, since there is no ambiguity
# r0  0  1.0  NaN
# r1  2  NaN  3.0

# notice how the columns have lost their name ('ID').

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-77-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.1
pytest: 3.3.2
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.5
scipy: 1.0.0
pyarrow: 0.8.0
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.4
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: 2.7.3.2 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: 0.1.3
pandas_gbq: None
pandas_datareader: None
@WillAyd
Copy link
Member

WillAyd commented Jun 26, 2018

I think there's a typo in your first example. That said, what is the use case for this? Seems kind of counter-intuitive to me to have two different Index objects with the same name. You could just as easily assign that name after the concat

@WillAyd WillAyd added the Needs Info Clarification about behavior needed to assess issue label Jun 26, 2018
@jorisvandenbossche jorisvandenbossche removed the Needs Info Clarification about behavior needed to assess issue label Jun 26, 2018
@jorisvandenbossche
Copy link
Member

Fixed the typo.

what is the use case for this? Seems kind of counter-intuitive to me to have two different Index objects with the same name.

The use case can be to keep the name, if you have an identical name. There can be many reasons that for some reason the dataframes you want to concat got somehow mis-aligned.

However, I don't know to what extent we have prior art in pandas with regard to keeping the name or not if the indexes are not identical.

At least union seems to keep it:

In [24]: pd.Index(['a', 'b'], name='ID').union(pd.Index(['a', 'c'], name='ID'))
Out[24]: Index(['a', 'b', 'c'], dtype='object', name='ID')

which seems an indication to me that concat could also keep the name ?

@jorisvandenbossche jorisvandenbossche added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Jun 26, 2018
@pdemarti
Copy link
Author

Thanks for fixing the typo.

There are many use cases. For us, the most prevalent one is when we deal with large multivariate time-series. We split them by time (the Index) for easier storage and update (typically the last few slices are most frequently updated). The columns are in the thousands, and typically their intersection is at least 99% of their union. When we concatenate these frames, we would like the name of the axis to remain.

That said, I was under the false impression that the behavior had changed in 0.23, but it is not the case (I checked many versions from 0.15.0 to 0.22.0). The reason I thought that was, in previous versions, our code was different and just building the index union by itself, then reindex all frames and then only concat (this was faster). As @jorisvandenbossche pointed out, index.union() keeps the index name.

We had to change that part in response to the way 0.23 concat now handles mis-aligned non-concatenating index. I still believe the behavior should be to retain the name of the index during concat. Either take the first one (as index0.union(index1).union(index2)... does) or by taking the majority name (or the single name if they are all the same and None otherwise).

@FANGOD
Copy link
Contributor

FANGOD commented Aug 25, 2018

If the index of multiple df is different, copy the index name of the first df to the df after concat, no matter whether the index of multiple df is different or the index name is different, it is feasible.Of course, it doesn't solve the problem fundamentally.

@dsm054
Copy link
Contributor

dsm054 commented Nov 13, 2018

Is this the same as #13475? I was working on a PR for that one and it seems to handle this case as well.

@0anton
Copy link

0anton commented Jun 15, 2019

same here on pandas 0.24.2

@phofl
Copy link
Member

phofl commented Sep 11, 2020

Closing as duplicate of #13475, was fixed with that pr

@phofl phofl closed this as completed Sep 11, 2020
@phofl phofl added Duplicate Report Duplicate issue or pull request and removed Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Sep 11, 2020
@phofl phofl modified the milestones: Contributions Welcome, No action Sep 11, 2020
@phofl phofl added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Sep 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

7 participants