ENH: Auto Detect engine in read_excel #35416

fzumstein · 2020-07-26T21:26:31Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

This works:

import pandas as pd
df = pd.read_excel('Book2.xlsb', engine='pyxlsb')

This doesn't:

import pandas as pd
df = pd.read_excel('Book2.xlsb')

Problem description

All supported Excel formats (xls, xlsx, xlsm) have a default engine based on the extension, so you can simply do:

df = pd.read_excel('Book2.xlsx')

For xlsb, when you do df = pd.read_excel('Book2.xlsb'), you get this error:

XLRDError: Excel 2007 xlsb file; not supported

Expected Output

No error, i.e. it should figure out that for xlsb extensions, pyxlsb is the default engine.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None

pandas : 1.0.5
numpy : 1.18.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.0.2
setuptools : 49.2.0.post20200714
Cython : None
pytest : 5.4.3
hypothesis : None
sphinx : 2.4.0
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.16.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.3
numexpr : None
odfpy : None
openpyxl : 3.0.4
pandas_gbq : None
pyarrow : 0.17.1
pytables : None
pytest : 5.4.3
pyxlsb : 1.0.6
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : 0.8.7
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.7
numba : None

The text was updated successfully, but these errors were encountered:

fzumstein · 2020-07-26T21:27:26Z

Attached a sample file to test.
Book2.xlsb.zip

simonjayhawkins · 2020-07-27T15:36:00Z

Thanks @fzumstein for the report. I'm not sure of the history here, but the release note when the xlsb functionality was added is

:func:read_excel now can read binary Excel (.xlsb) files by passing engine='pyxlsb'. For more details and example usage, see the :ref:Binary Excel files documentation <io.xlsb>. Closes :issue:8540.

so I assume that this was a conscious design choice and therefore labelling as an enhancement.

fzumstein · 2020-07-27T16:01:56Z

Hi @simonjayhawkins, thanks for your reply and I am not familiar with the history either, it just hit me as an inconsistency.

WillAyd · 2020-07-27T17:28:51Z

I don't think the readers infer the appropriate engine based off of the file extension; we historically only ever used xlrd but in the past year or longer have added a lot more without adding auto detection

I think would take a PR to auto detect if there's a reasonable way to do it

fzumstein · 2020-07-27T18:02:57Z

@WillAyd I see - what about just refactoring _get_default_writer(ext) into something like _get_default_engine(ext, mode)? Would possibly fit into this open refactor: #28547

WillAyd · 2020-07-27T18:05:04Z

I think its' separate from that issue number but the code you've linked to is a great reference. We do it on the writer side, so if you would like to implement something on the reader side and push a PR would definitely be appreciated!

roberthdevries · 2020-08-23T15:30:25Z

Related to #34252, as that ticket mentions the fact that the settings that are used to auto-detect the engine do not work. Which makes sense as they are not used at all.

lithomas1 · 2020-12-26T15:22:44Z

take

fzumstein added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 26, 2020

simonjayhawkins added Enhancement IO Excel read_excel, to_excel and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 27, 2020

simonjayhawkins added this to the Contributions Welcome milestone Jul 27, 2020

WillAyd changed the title ~~BUG: read_excel with xlsb files require the engine to be set explicitly~~ ENH: Auto Detect engine in read_excel Jul 27, 2020

mzeitlin11 mentioned this issue Dec 25, 2020

BUG: read_excel does not use pyxlsb for xlsb files when engine is None #38667

Closed

github-actions bot assigned lithomas1 Dec 26, 2020

rhshadrach mentioned this issue Dec 27, 2020

ENH: Add xlsb auto detection to read_excel and respect default options #38710

Merged

5 tasks

jreback modified the milestones: Contributions Welcome, 1.3 Jan 1, 2021

jreback closed this as completed in #38710 Jan 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Auto Detect engine in read_excel #35416

ENH: Auto Detect engine in read_excel #35416

fzumstein commented Jul 26, 2020

INSTALLED VERSIONS

fzumstein commented Jul 26, 2020

simonjayhawkins commented Jul 27, 2020

fzumstein commented Jul 27, 2020 •

edited

Loading

WillAyd commented Jul 27, 2020

fzumstein commented Jul 27, 2020

WillAyd commented Jul 27, 2020

roberthdevries commented Aug 23, 2020

lithomas1 commented Dec 26, 2020

ENH: Auto Detect engine in read_excel #35416

ENH: Auto Detect engine in read_excel #35416

Comments

fzumstein commented Jul 26, 2020

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

fzumstein commented Jul 26, 2020

simonjayhawkins commented Jul 27, 2020

fzumstein commented Jul 27, 2020 • edited Loading

WillAyd commented Jul 27, 2020

fzumstein commented Jul 27, 2020

WillAyd commented Jul 27, 2020

roberthdevries commented Aug 23, 2020

lithomas1 commented Dec 26, 2020

Output of `pd.show_versions()`

fzumstein commented Jul 27, 2020 •

edited

Loading