Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Auto Detect engine in read_excel #35416

Closed
2 of 3 tasks
fzumstein opened this issue Jul 26, 2020 · 8 comments · Fixed by #38710
Closed
2 of 3 tasks

ENH: Auto Detect engine in read_excel #35416

fzumstein opened this issue Jul 26, 2020 · 8 comments · Fixed by #38710
Assignees
Labels
Enhancement IO Excel read_excel, to_excel
Milestone

Comments

@fzumstein
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

This works:

import pandas as pd
df = pd.read_excel('Book2.xlsb', engine='pyxlsb')

This doesn't:

import pandas as pd
df = pd.read_excel('Book2.xlsb')

Problem description

All supported Excel formats (xls, xlsx, xlsm) have a default engine based on the extension, so you can simply do:

df = pd.read_excel('Book2.xlsx')

For xlsb, when you do df = pd.read_excel('Book2.xlsb'), you get this error:

XLRDError: Excel 2007 xlsb file; not supported

Expected Output

No error, i.e. it should figure out that for xlsb extensions, pyxlsb is the default engine.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None

pandas : 1.0.5
numpy : 1.18.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.0.2
setuptools : 49.2.0.post20200714
Cython : None
pytest : 5.4.3
hypothesis : None
sphinx : 2.4.0
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.16.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.3
numexpr : None
odfpy : None
openpyxl : 3.0.4
pandas_gbq : None
pyarrow : 0.17.1
pytables : None
pytest : 5.4.3
pyxlsb : 1.0.6
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : 0.8.7
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.7
numba : None

@fzumstein fzumstein added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 26, 2020
@fzumstein
Copy link
Author

Attached a sample file to test.
Book2.xlsb.zip

@simonjayhawkins simonjayhawkins added Enhancement IO Excel read_excel, to_excel and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 27, 2020
@simonjayhawkins
Copy link
Member

Thanks @fzumstein for the report. I'm not sure of the history here, but the release note when the xlsb functionality was added is

:func:read_excel now can read binary Excel (.xlsb) files by passing engine='pyxlsb'. For more details and example usage, see the :ref:Binary Excel files documentation <io.xlsb>. Closes :issue:8540.

so I assume that this was a conscious design choice and therefore labelling as an enhancement.

@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone Jul 27, 2020
@fzumstein
Copy link
Author

fzumstein commented Jul 27, 2020

Hi @simonjayhawkins, thanks for your reply and I am not familiar with the history either, it just hit me as an inconsistency.

@WillAyd
Copy link
Member

WillAyd commented Jul 27, 2020

I don't think the readers infer the appropriate engine based off of the file extension; we historically only ever used xlrd but in the past year or longer have added a lot more without adding auto detection

I think would take a PR to auto detect if there's a reasonable way to do it

@WillAyd WillAyd changed the title BUG: read_excel with xlsb files require the engine to be set explicitly ENH: Auto Detect engine in read_excel Jul 27, 2020
@fzumstein
Copy link
Author

@WillAyd I see - what about just refactoring _get_default_writer(ext) into something like _get_default_engine(ext, mode)? Would possibly fit into this open refactor: #28547

@WillAyd
Copy link
Member

WillAyd commented Jul 27, 2020

I think its' separate from that issue number but the code you've linked to is a great reference. We do it on the writer side, so if you would like to implement something on the reader side and push a PR would definitely be appreciated!

@roberthdevries
Copy link
Contributor

Related to #34252, as that ticket mentions the fact that the settings that are used to auto-detect the engine do not work. Which makes sense as they are not used at all.

@lithomas1
Copy link
Member

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants