BUG: column labels converted to string in merge #46885

ericman93 · 2022-04-27T20:45:37Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

class Column:
    def __init__(self, name):
        self.name = name

col = Column(name='col')
df1 = pd.DataFrame({col: [1], 'X': [2]})
df2 = pd.DataFrame({col: [1], 'Y': [3]})

merged = pd.merge(left=df1, right=df2, left_index=True, right_index=True)

assert not isinstance(merged.columns.tolist()[0], str)

Issue Description

merged dataframe columns converted to string (because the suffix was added to the equal column)

> merged.columns.tolist()
['<__main__.Column object at 0x7f41edd52d50>_x',
 'X',
 '<__main__.Column object at 0x7f41edd52d50>_y',
 'Y']

Expected Behavior

I would expect merge to keep the column of type __main__.Column and not covert it to string
Regards the duplication, IMO its ok to have 2 identical columns and let the user decide how to handle it by his own

Installed Versions

INSTALLED VERSIONS

commit : 4bfe3d0
python : 3.8.7.final.0
python-bits : 64
OS : Darwin
OS-release : 20.2.0
Version : Darwin Kernel Version 20.2.0: Wed Dec 2 20:39:59 PST 2020; root:xnu-7195.60.75~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 1.4.2
numpy : 1.22.3
pytz : 2022.1
dateutil : 2.8.2
pip : 20.2.3
setuptools : 49.2.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

The text was updated successfully, but these errors were encountered:

ericman93 · 2022-04-27T20:45:57Z

I fix this bug here: #46879

rhshadrach · 2022-05-16T20:43:11Z

I'm -1 on having duplicate columns in the result. Currently

df1 = pd.DataFrame({'col': [1], 'X': [2]})
df2 = pd.DataFrame({'col': [1], 'Y': [3]})
merged = pd.merge(left=df1, right=df2, left_index=True, right_index=True, suffixes=(None, None))

raises "ValueError: columns overlap but no suffix specified: Index(['col'], dtype='object')". Allowing duplicate columns only if they are not strings (or not integers or not ...?) is surprising to me.

simonjayhawkins · 2022-05-17T16:12:54Z

I'm -1 on having duplicate columns in the result.

since pandas 1.2 we have a new mechanism to disallow duplicate column labels.

rhshadrach · 2022-05-17T20:57:16Z

@simonjayhawkins - not sure I follow. For this proposed feature, under what condition(s) on the column name dtypes does the snippet I posted raise or allows duplicates?

simonjayhawkins · 2022-05-18T08:52:47Z

needs discussion for this scenario, comment was more holistic.

going forward, it should not need to be a decision (or personal preference) on whether a method returns duplicates. duplicate column labels are a documented pandas feature https://pandas.pydata.org/pandas-docs/stable/user_guide/duplicates.html#duplicate-labels and therefore all methods should support them, work correctly with them and correctly propagate them.

and since pandas 1.2, https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.2.0.html#optionally-disallow-duplicate-labels the mechanism for disallowing duplicate column labels now means that allowing/disallowing duplicate column labels should not need to be incorporated into the api design of individual methods.

I'm -1 on having duplicate columns in the result.

sure. users not wanting duplicate column labels are accommodated and will be able to use .set_flags(allows_duplicate_labels=False) going forward.

rhshadrach · 2022-05-24T22:01:35Z

@simonjayhawkins

users not wanting duplicate column labels are accommodated and will be able to use .set_flags(allows_duplicate_labels=False) going forward.

I do not think this is correct. Index labels and column labels are not the same. Duplicate index labels occur often in frames that I work with, yet I never allow duplicate column labels because they are almost impossible to work with. I am not able to use this setting because it forbids both duplicate index and column labels, and I don't think my usage/experience is niche.

ericman93 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 27, 2022

ericman93 mentioned this issue Apr 27, 2022

Merge nonstring columns #46879

Closed

4 tasks

simonjayhawkins added Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 4, 2022

simonjayhawkins changed the title ~~BUG: complex type columns converted to string in merge~~ BUG: column labels converted to string in merge May 4, 2022

simonjayhawkins added this to the Contributions Welcome milestone May 4, 2022

jreback modified the milestones: Contributions Welcome, 1.5 May 20, 2022

mroeschke removed this from the 1.5 milestone Aug 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: column labels converted to string in merge #46885

BUG: column labels converted to string in merge #46885

ericman93 commented Apr 27, 2022

INSTALLED VERSIONS

ericman93 commented Apr 27, 2022

rhshadrach commented May 16, 2022

simonjayhawkins commented May 17, 2022

rhshadrach commented May 17, 2022 •

edited

Loading

simonjayhawkins commented May 18, 2022

rhshadrach commented May 24, 2022

BUG: column labels converted to string in merge #46885

BUG: column labels converted to string in merge #46885

Comments

ericman93 commented Apr 27, 2022

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

ericman93 commented Apr 27, 2022

rhshadrach commented May 16, 2022

simonjayhawkins commented May 17, 2022

rhshadrach commented May 17, 2022 • edited Loading

simonjayhawkins commented May 18, 2022

rhshadrach commented May 24, 2022

rhshadrach commented May 17, 2022 •

edited

Loading