Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: index.name not preserved in concat in case of unequal object index #13475

Closed
dllahr opened this issue Jun 17, 2016 · 20 comments · Fixed by #35338
Closed

BUG: index.name not preserved in concat in case of unequal object index #13475

dllahr opened this issue Jun 17, 2016 · 20 comments · Fixed by #35338
Labels
Bug Constructors Series/DataFrame/Index/pd.array Constructors Index Related to the Index class or subclasses Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@dllahr
Copy link

dllahr commented Jun 17, 2016

xref #13742 for addl cases.

In [23]: df1 = pd.DataFrame({'a':[1,2]}, index=pd.Index(['a', 'b'], name='idx'))

In [24]: df2 = pd.DataFrame({'b':[2,3]}, index=pd.Index(['b', 'c'], name='idx'))

In [26]: pd.concat([df1, df2], axis=1)
Out[26]:
     a    b
a  1.0  NaN
b  2.0  2.0
c  NaN  3.0

In [27]: print pd.concat([df1, df2], axis=1).index.name
None

So the issue seems to be with a string index that is not equal, as when the index of the two frames is equal (no NaNs are introduced), the name is kept and also when using numerical indexes, see #13475 (comment)


When I use the concat function with input dataframes that have index.name assigned, sometimes the resulting dataframe has the index.name assigned, sometimes it does not.

I ran the code below from the python interpreter, using a conda environment with pandas-0.18.1

I don't see any odd / extra characters around the "pert_well" column in the files between the files.

Code Sample, a copy-pastable example if possible

import pandas

a_data = """x_amount_mg x_annotation    x_mmoles_per_liter  mfc_plate_name  x_avg_mol_weight    x_volume_ul pert_mfc_desc   pert_iname  x_purity    pert_id_vendor  pert_well   pert_vehicle    pert_mfc_id x_smiles    x_mg_per_ml pert_dose_unit  pert_dose   pert_id pert_plate  pert_type
0.04784 ACCEPT  10.0    B-REPO-01-B64-101   405.4084    11  Taltirelin  Taltirelin  86.52   HY-B0596    C18 DMSO    BRD-K93869735-001-01-1  CN1C(=O)C[C@H](NC1=O)C(=O)N[C@@H](Cc1cnc[nH]1)C(=O)N1CCC[C@H]1C(N)=O    4.054084    um  20.0    BRD-K93869735   PMEL008 trt_cp"""

b_data = """pert_well   pert_2_type pert_2_id   pert_2_mfc_id   pert_2_mfc_desc pert_2_id_vendor    pert_2_iname    pert_2_dose pert_2_dose_unit    pert_2_vehicle  pert_3_type pert_3_idpert_3_mfc_id  pert_3_mfc_desc pert_3_id_vendor    pert_3_iname    pert_3_dose pert_3_dose_unit    pert_3_vehicle
A01 ctl_vehicle DMSO    DMSO    DMSO    -666    DMSO    -666    -666    -666    ctl_untrt   CMAP-000    -666    UnTrt   -666    -666    -666    -666    -666"""

d_data = """x_amount_mg x_annotation    x_mmoles_per_liter  mfc_plate_name  x_avg_mol_weight    x_volume_ul pert_mfc_desc   pert_iname  x_purity    pert_id_vendor  pert_well   pert_vehicle    pert_mfc_id x_smiles    x_mg_per_ml pert_dose_unit  pert_dose   pert_id pert_plate  pert_type
0.0 -666    -666    B-REPO-01-B64-107   -666    0   -666    -666    -666    -666    A01 -666    -666    -666    -666    -666    -666    CMAP-000    PMEL001 ctl_untrt"""

a = pandas.read_csv(StringIO(a_data), sep="\t", index_col="pert_well")
b = pandas.read_csv(StringIO(b_data), sep="\t", index_col="pert_well")
c = pandas.concat([a,b], axis=1)
c.index

d = pandas.read_csv(StringIO(d_data), sep="\t", index_col="pert_well")
e = pandas.concat([d,b], axis=1)
e.index

results:

Index([u'A01', u'A02', u'A03', u'A04', u'A05', u'A06', u'A07', u'A08', u'A09',
       u'A10',
       ...
       u'P15', u'P16', u'P17', u'P18', u'P19', u'P20', u'P21', u'P22', u'P23',
       u'P24'],
      dtype='object', length=384)

Index([u'A01', u'A02', u'A03', u'A04', u'A05', u'A06', u'A07', u'A08', u'A09',
       u'A10',
       ...
       u'P15', u'P16', u'P17', u'P18', u'P19', u'P20', u'P21', u'P22', u'P23',
       u'P24'],
      dtype='object', name=u'pert_well', length=384)

Expected Output

c.index.name should be "pert_well"

output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-573.7.1.el6.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C

pandas: 0.18.1
nose: None
pip: 8.1.2
setuptools: 23.0.0
Cython: None
numpy: 1.11.0
scipy: None
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

PMEL_input_files_for_pandas_issue.zip

@jreback
Copy link
Contributor

jreback commented Jun 17, 2016

pls make this copy pastable rather than using actual files

@dllahr
Copy link
Author

dllahr commented Jun 17, 2016

Do you want me to paste in tab-delimited text?

@dllahr
Copy link
Author

dllahr commented Jun 17, 2016

Done.

@jreback
Copy link
Contributor

jreback commented Jun 17, 2016

@dllahr pls edit so I can literally copy and past it,
e.g. somethign like

import pandas as pd
from StringIO import StringIO
data = """
......
"""
df = pd.read_csv(StringIO(data))

then one can simply copy and paste. This is most useful for reproductions, but also to facilitate checking if this is even an existing bug.

@vkcelik
Copy link

vkcelik commented Jun 21, 2016

I think I have the same problem. Should be reproducible with this code:

import pandas as pd
try:
  from io import StringIO
except ImportError:
  from StringIO import StringIO

data1 = "Cores\tServer\n20,000\tS000\n-20,000\tS003\n16,000\tS140\n2,000\tS148\n2,000\tS149\n"

data2 = "Cores\tServer\n20,000\tS103\n16,000\tS140\n2,000\tS148\n2,000\tS149\n4,000\tS150\n"

df1 = pd.read_csv(StringIO(data1), sep='\t', index_col=['Server'], decimal=',')
df2 = pd.read_csv(StringIO(data2), sep='\t', index_col=['Server'], decimal=',')

df1.rename(columns=lambda x: x + '_1', inplace=True)
df2.rename(columns=lambda x: x + '_2', inplace=True)

joined = pd.concat([df1, df2], axis=1)

print(df1)
print(df2)
print(joined)
print(joined.index.name)
print(joined.index.names)

I expected the output of print(joined.index.names) to be ['Server'], but it is [None].

Can anybody reproduce this? Is this expected behavior, and if so why?

Output of pd.show_versions():

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 37 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.17.1
nose: 1.3.7
pip: None
setuptools: 19.6.2
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
IPython: 4.0.3
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.6
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.11
pymysql: None
psycopg2: None
Jinja2: 2.8

@dllahr
Copy link
Author

dllahr commented Jul 12, 2016

Can you remove the can't Repro label, because the new comment has the formatting you require?

@jreback
Copy link
Contributor

jreback commented Jul 12, 2016

@dllahr how so? need an example that one can simply literaally copy-paste. Don't use files.
Instead

data = """
filedata you need
"""
df = pd.read_csv(StringIO(data))
.....

@jreback
Copy link
Contributor

jreback commented Jul 12, 2016

@dllahr pls put this actual example and replace the top part of the PR in that case. Its too confusing.

@dllahr
Copy link
Author

dllahr commented Jul 12, 2016

Well there's the comment by vkcelik before I did anything.

@dllahr
Copy link
Author

dllahr commented Jul 12, 2016

How about now?

@jreback
Copy link
Contributor

jreback commented Jul 13, 2016

the tabs don't reproduce when copy-pasting. and the example should be much simpler. The idea is to make this into a test. If its too much effort it won't happen.

@dllahr
Copy link
Author

dllahr commented Jul 13, 2016

Awesome. If the problem is related to parsing the tab delimited input, then how are you going to reproduce it? I've given you more than enough information to reproduce this bug, if you choose to ignore it, that's your business. I'll keep that in mind next time someone asks me about Pandas.

@jreback
Copy link
Contributor

jreback commented Jul 13, 2016

@dllahr that's quite a poor attitude

@dllahr
Copy link
Author

dllahr commented Jul 13, 2016

I agree you are displaying a poor attitude.

@shoyer
Copy link
Member

shoyer commented Jul 13, 2016

@dllahr please see http://stackoverflow.com/help/mcve

Tools like pandas are successful because of community contributions. We get a lot of bug reports, so every bit of help making an issue easier to reproduce helps.

@jorisvandenbossche
Copy link
Member

@dllahr As an example, I tried to reproduce this with a simple example:

In [23]: df1 = pd.DataFrame({'a':[1,2]}, index=pd.Index(['a', 'b'], name='idx'))

In [24]: df2 = pd.DataFrame({'b':[2,3]}, index=pd.Index(['b', 'c'], name='idx'))

In [26]: pd.concat([df1, df2], axis=1)
Out[26]:
     a    b
a  1.0  NaN
b  2.0  2.0
c  NaN  3.0

In [27]: print pd.concat([df1, df2], axis=1).index.name
None

So the issue seems to be with a string index that is not equal, as when the index of the two frames is equal (no NaNs are introduced), the name is kept:

In [29]: df2.index = pd.Index(['a', 'b'], name='idx')

In [30]: pd.concat([df1, df2], axis=1)
Out[30]:
     a  b
idx
a    1  2
b    2  3

In [31]: print pd.concat([df1, df2], axis=1).index.name
idx

and when using a numerical index, the name is also kept:

In [32]: df1.index = pd.Index([0, 1], name='idx')

In [33]: df2.index = pd.Index([1, 2], name='idx')

In [34]: pd.concat([df1, df2], axis=1)
Out[34]:
       a    b
idx
0    1.0  NaN
1    2.0  2.0
2    NaN  3.0

In [35]: print pd.concat([df1, df2], axis=1).index.name
idx

@jorisvandenbossche jorisvandenbossche added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed Can't Repro labels Jul 13, 2016
@jorisvandenbossche jorisvandenbossche changed the title index.name not assigned sometimes when using concat BUG: index.name not preserved in concat in case of unequal object index Jul 13, 2016
@jreback jreback added this to the Next Major Release milestone Jul 13, 2016
@dllahr
Copy link
Author

dllahr commented Jul 13, 2016

@jorisvandenbossche thank you for finding a much better example.

@shoyer Thank you for that, but I provided the minimal example that I could deduce given my time constraints and abilities. If I could have provided a simpler example, I would have. As some feedback for you (collectively), given this "experience" I'm pretty sure next time I just won't bother reporting any bug or issue I find.

@TomAugspurger
Copy link
Contributor

Apologies if any of that came of brusque @dllahr, trying to manage the crazy numbers of issues efficiently. Need to leverage your first-hand experience with the bug as much as possible to diagnose the issue. Thanks for the report!

@0anton
Copy link

0anton commented Jun 15, 2019

experiencing the same bug on pandas 0.24.2

@iamlemec
Copy link
Contributor

I believe I have a 2 line fix to union_indexes that takes care of this. Should I submit a PR or just paste the diff here?

@jreback jreback modified the milestones: Contributions Welcome, 1.2 Aug 6, 2020
@jreback jreback added Constructors Series/DataFrame/Index/pd.array Constructors Index Related to the Index class or subclasses labels Aug 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Constructors Series/DataFrame/Index/pd.array Constructors Index Related to the Index class or subclasses Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants