Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading from Excel mangles columns #10523

Closed
atheyjohnc opened this issue Jul 7, 2015 · 6 comments
Closed

Reading from Excel mangles columns #10523

atheyjohnc opened this issue Jul 7, 2015 · 6 comments
Labels
Enhancement IO Excel read_excel, to_excel

Comments

@atheyjohnc
Copy link

As far as I can tell, both methods of reading an Excel sheet (through ExcelFile and read_excel) do not have the option to avoid mangling duplicate columns, which exists for read_csv (#3468).

I have an Excel file that looks like this:

  foo  foo  bar  bar  baz  baz
    A    B    A    B    A    B
  123  456  789   12  345  678
  901  234  567  890  123  456
  789   12  345  678  901  234

I initially encountered this problem while trying to find a workaround for not being able to specify a multirow-header (#4679). Reading in this Excel file with header = 0 (default) mangles the column names:

>>> df = pd.read_excel("dupe_cols.xlsx", header = 0)
>>> print df
   foo foo.1  bar bar.1  baz baz.1
0    A     B    A     B    A     B
1  123   456  789    12  345   678
2  901   234  567   890  123   456
3  789    12  345   678  901   234

Unlike read_csv, read_excel does not have the option to avoid mangling duplicate columns (using ExcelFile.parse works the same as far as I can see). Specifying header = None in read_excel and then assigning the column names to the first row will effectively allow you to avoid mangling the column names. You could also read in the Excel sheet with header = None, save a .csv with header = False and index = False, then read that csv and specify mangle_dupe_cols = False to get the dataframe you want. (Incidentally, this also allows you to specify multiple rows as the header, which is the behavior I was originally trying to emulate.)

@jreback
Copy link
Contributor

jreback commented Jul 7, 2015

can you post a small example that can be used as a tests in the top section (e.g. create a frame, write to excel), show the arguments used to read back

@jreback jreback added this to the Next Major Release milestone Jul 7, 2015
@gfyoung gfyoung modified the milestones: Contributions Welcome, No action Nov 13, 2018
@gfyoung
Copy link
Member

gfyoung commented Nov 13, 2018

Because we pass in **kwds from read_excel to TextFileReader, we actually do accept mangle_dupe_cols now. 🎉 We just need to surface it explicitly.

Otherwise, this issue is superseded by #13262.

@geminizb
Copy link

I'm afraid ExcelFile.parse is not changed yet

@dhimmel
Copy link
Contributor

dhimmel commented Apr 6, 2020

With pandas 1.0.3 I got Setting mangle_dupe_cols=False is not supported yet.

~/anaconda3/lib/python3.6/site-packages/pandas/io/excel/_base.py in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, verbose, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols, **kwds)
    332         convert_float=convert_float,
    333         mangle_dupe_cols=mangle_dupe_cols,
--> 334         **kwds,
    335     )
    336 

~/anaconda3/lib/python3.6/site-packages/pandas/io/excel/_base.py in parse(self, sheet_name, header, names, index_col, usecols, squeeze, converters, true_values, false_values, skiprows, nrows, na_values, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols, **kwds)
    886             convert_float=convert_float,
    887             mangle_dupe_cols=mangle_dupe_cols,
--> 888             **kwds,
    889         )
    890 

~/anaconda3/lib/python3.6/site-packages/pandas/io/excel/_base.py in parse(self, sheet_name, header, names, index_col, usecols, squeeze, dtype, true_values, false_values, skiprows, nrows, na_values, verbose, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols, **kwds)
    510                     usecols=usecols,
    511                     mangle_dupe_cols=mangle_dupe_cols,
--> 512                     **kwds,
    513                 )
    514 

~/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py in TextParser(*args, **kwds)
   2199     """
   2200     kwds["engine"] = "python"
-> 2201     return TextFileReader(*args, **kwds)
   2202 
   2203 

~/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    865         self._currow = 0
    866 
--> 867         options = self._get_options_with_defaults(engine)
    868 
    869         self.chunksize = options.pop("chunksize", None)

~/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py in _get_options_with_defaults(self, engine)
    893             # see gh-12935
    894             if argname == "mangle_dupe_cols" and not value:
--> 895                 raise ValueError("Setting mangle_dupe_cols=False is not supported yet")
    896             else:
    897                 options[argname] = value

ValueError: Setting mangle_dupe_cols=False is not supported yet

@talatccan
Copy link

I got ValueError: Setting mangle_dupe_cols=False is not supported yet for pandas version 1.0.4

Code:

df = pd.read_excel(DATA_PATH, mangle_dupe_cols=False)

Error Details:

 File "site-packages\pandas\io\excel\_base.py", line 334, in read_excel
    **kwds,
  File "site-packages\pandas\io\excel\_base.py", line 888, in parse
    **kwds,
  File "site-packages\pandas\io\excel\_base.py", line 512, in parse
    **kwds,
  File "site-packages\pandas\io\parsers.py", line 2201, in TextParser
    return TextFileReader(*args, **kwds)
  File "site-packages\pandas\io\parsers.py", line 867, in __init__
    options = self._get_options_with_defaults(engine)
  File "site-packages\pandas\io\parsers.py", line 895, in _get_options_with_defaults
    raise ValueError("Setting mangle_dupe_cols=False is not supported yet")
ValueError: Setting mangle_dupe_cols=False is not supported yet

@ronald8192
Copy link

Using pandas-1.2.0, got an error: ValueError: Setting mangle_dupe_cols=False is not supported yet

>>> pd.read_excel('test.xlsx',mangle_dupe_cols=False)
<stdin>:1: FutureWarning: Your version of xlrd is 1.2.0. In xlrd >= 2.0, only the xls format is supported. As a result, the openpyxl engine will be used if it is installed and the engine argument is not specified. Install openpyxl instead.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/linuxbrew/.linuxbrew/opt/python@3.8/lib/python3.8/site-packages/pandas/util/_decorators.py", line 299, in wrapper
    return func(*args, **kwargs)
  File "/home/linuxbrew/.linuxbrew/opt/python@3.8/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 344, in read_excel
    data = io.parse(
  File "/home/linuxbrew/.linuxbrew/opt/python@3.8/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 1153, in parse
    return self._reader.parse(
  File "/home/linuxbrew/.linuxbrew/opt/python@3.8/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 532, in parse
    parser = TextParser(
  File "/home/linuxbrew/.linuxbrew/opt/python@3.8/lib/python3.8/site-packages/pandas/io/parsers.py", line 2224, in TextParser
    return TextFileReader(*args, **kwds)
  File "/home/linuxbrew/.linuxbrew/opt/python@3.8/lib/python3.8/site-packages/pandas/io/parsers.py", line 801, in __init__
    options = self._get_options_with_defaults(engine)
  File "/home/linuxbrew/.linuxbrew/opt/python@3.8/lib/python3.8/site-packages/pandas/io/parsers.py", line 829, in _get_options_with_defaults
    raise ValueError("Setting mangle_dupe_cols=False is not supported yet")
ValueError: Setting mangle_dupe_cols=False is not supported yet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Excel read_excel, to_excel
Projects
None yet
Development

No branches or pull requests

7 participants