shift default excel read engine from xlrd to openpyxl #38424

leo4183 · 2020-12-12T14:14:14Z

since v2.0.0, xlrd no longer supports excel files other than ".xls". manually specify pd.read_excel engine (to openpyxl and etc) is a bit annoying (otherwise pandas would complain about the missing xlrd module). is it possible to shift the default engine to sth more commonly used (eg. openpyxl )

rhshadrach · 2020-12-12T14:55:57Z

Thanks for raising this, I was not aware of the recent changes to xlrd. In pandas 1.2, if you have openpyxl installed it will be used as the default for reading non-xls files. See https://pandas.pydata.org/pandas-docs/dev/whatsnew/v1.2.0.html for more info.

With the change to xlrd, I think the messages in the whatsnew and potentially warning messages need to be updated. cc @jorisvandenbossche @jreback

jorisvandenbossche · 2020-12-12T15:12:49Z

@rhshadrach see #28547 (comment) and my answer below. I think we indeed should check for this version in pandas, and update the error message (and whatsnew message) to make it clearer.

cjw296 · 2020-12-12T15:16:13Z

I'm sorry for causing you folk extra work :-(

simonjayhawkins · 2020-12-13T12:55:37Z

@rhshadrach see #28547 (comment) and my answer below. I think we indeed should check for this version in pandas, and update the error message (and whatsnew message) to make it clearer.

added the blocker label.

rhshadrach · 2020-12-13T15:24:56Z

@jorisvandenbossche My understanding is that, other than ods-streams, only xlrd can read buffer-like values (i.e. non-path) of path_or_buffer in read_excel. As such, one thought is (roughly) to define:

is_xlrd_only = "(non-ods buffer) or (xls-type extension) or (xlrd.Book instance)"

Then:

Emit a FutureWarning if:
- engine was passed as None;
- not is_xlrd_only;
- openpyxl is not installed (as would use openpyxl); and
- and xlrd_version < 2
Emit a ValueError anytime (even if engine was not passed as None):
- engine was determined to be "xlrd";
- not is_xlrd_only; and
- xlrd_version >= 2

One case I want to highlight is when read_excel is passed a buffer to a non-xls/ods format. With the logic above, we'd be passing this onto xlrd without warning. I don't know if there is a good way to tell if the buffer can handled by xlrd.

jorisvandenbossche · 2020-12-13T15:53:29Z

only xlrd can read buffer-like values (i.e. non-path)

But only if it is a buffer of a file type it does support? Eg with xlrd >= 2.0, it won't read a buffer-like of an xlsx file?

Emit a ValueError anytime (even if engine was not passed as None): ....

Yes, that sounds correct, because for those conditions, xlrd (>= 2.0) would error anyway (I think), and so we want to raise the error, so we can give it a pandas-specific error message.

One case I want to highlight is when read_excel is passed a buffer to a non-xls/ods format. With the logic above, we'd be passing this onto xlrd without warning. I don't know if there is a good way to tell if the buffer can handled by xlrd.

Yeah, see @cjw296's comment here: #28547 (comment). He mentions an inspect_format which might help here. That could maybe be used if xlrd is installed, and if it's version >= 2.0, so we can check if the buffer is readable by xlrd or not.

rhshadrach · 2020-12-13T15:57:44Z

Thanks @jorisvandenbossche, I had missed the inspect_format part. I think that solves the part where I had reservations.

cjw296 · 2020-12-13T17:33:31Z

@rhshadrach - I'm the maintainer of xlrd, so I wanted to be clear on the intention here: xlrd should only be used for .xls files, please can you emit a warning whenever it is used for anything else, no matter what version of xlrd is installed?

It absolutely should not be used for opening .xlsx files, or any other form of zip/xml files: there are security vulnerabilities lurking, it will break on Python 3.9 and it will fail at parsing some files which openpyxl will do a much better job of.

Please, I'm begging you, do not put in hacks which try and suggest users can get away with "sticking with xlrd 1.2" or whatever...

jorisvandenbossche · 2020-12-13T17:53:09Z

please can you emit a warning whenever it is used for anything else?

That's already the case on master / on 1.2rc, see #35029

The whatsnew notice about this (the first warning box in the 1.2 release notes) can be seen here: https://pandas.pydata.org/docs/dev/whatsnew/v1.2.0.html
Specific feedback on the messaging is certainly welcome.

jorisvandenbossche · 2020-12-13T17:58:35Z

Actually, re-reading that, I suppose we do not raise a warning when you manually specify engine="xlrd" but are reading a non-xls file. This is not the default (the default of engine=None will either use openpyxl if installed, or otherwise use xlrd with warning), and in the past we always used xlrd automatically in this case, so I don't expect many users to manually specify this, but that's still a case where we should warn as well, then.

cjw296 · 2020-12-13T18:24:49Z

I've thrown in #38456 for my preferred wording related to this, and also how I think pandas should behave, but to summarise here:

if .xls, then use xlrd.
if engine='xlrd' explicitly specified, just let it through. A nice to have here would be to warn if it's being used for a non-'.xls' file.
in other cases, use a more appropriate library or throw an exception that the file cannot be safely parsed.

FWIW, you may well want to copy the inspection code and the one constant it needs into pandas, for which you have my explicit permission. That would allow you to dispatch bytes streams using the same logic I describe above.

jorisvandenbossche · 2020-12-13T18:58:47Z

how I think pandas should behave, but to summarise here:

As far as I understand it, that's exactly how pandas now behaves (except for the potential warning if engine="xlrd" is specified explicitly but used for non-xls files)

cjw296 · 2020-12-13T19:24:53Z

As far as I understand it, that's exactly how pandas now behaves (except for the potential warning if engine="xlrd" is specified explicitly but used for non-xls files)

I'm afraid not:

this means that if the file has no extension, or bytes rather than a path are passed, xlrd will be preferred.
this defaults to 'xlrd', whereas I would prefer it raise an "unsupported exception".

That second one might be made more palatable by using the code I linked to in order to pick a better engine, in the case where you're trying to read a byte stream or from a file that has an incorrect or not-present file extension.

jorisvandenbossche · 2020-12-13T19:49:49Z

this means that if the file has no extension, or bytes rather than a path are passed, xlrd will be preferred.

Yes, that's the one case that @rhshadrach raised above as well (#38424 (comment)), and for which we gladly use the code in xlrd to better detect whether it is actually a xls file or not.

this defaults to 'xlrd', whereas I would prefer it raise an "unsupported exception".

As I also just commented in #38456 about this: in pandas we decided to do this change with a deprecation warning. So we will raise an exception in the future when xlrd gets being used for non-xls files; but for now only a warning.

I should have been clearer, and say "that's exactly how pandas now intents to behave" (as far as I understand it), so taking into account that we first do a warning before raising an exception, and except for the corner cases being discussed in this issue (for buffer objects we don't yet check the exact file type + also when specifying engine="xlrd" explicitly, we should probably warn if used with non-xls files)

cjw296 · 2020-12-13T20:20:28Z

Just to echo what I said in the PR comments:

In an ideal world, I would prevent anyone using xlrd < 2 from now onwards. It's not a good idea, even for .xls, and it's a positively dangerous idea for .xlsx.
Moving to pandas 1.2 would require a user to upgrade pandas, at the very least. If a user is already upgrading one package, it would be a perfect time to upgrade to xlrd 2 and/or openpyxl.

If I worked up a PR to do the guessing xlrd.inspect_format does, but in the pandas codebase, in return for warning whenever anyone uses xlrd < 2 and raising an exception, rather than using xlrd as the default, would it be merged for the pandas 1.2 release?

jorisvandenbossche · 2020-12-15T09:12:48Z

@jreback said (#38456 (review)):

we still have the question of what happens when > 2 xlrd is installed and our error conditions are triggered? IIUC this will simply raise the value error from inside xlrd, which is fine as the message is helpful, but are we also triggering the warning? maybe we should catch the exception and raise a nice warning say you should just use openpyxl?

do we have testing with > 2 xlrd?

jorisvandenbossche · 2020-12-15T09:41:12Z

what happens when > 2 xlrd is installed and our error conditions are triggered? IIUC this will simply raise the value error from inside xlrd, which is fine as the message is helpful, but are we also triggering the warning? maybe we should catch the exception and raise a nice warning say you should just use openpyxl?

Indeed, as mentioned above, I think we should check for this case, so we can raise a custom, pandas specific error message that is more informative for the pandas case (pandas supports other formats like xlsx, the error message from xlrd could give the impression that we don't)

But so the main question we need to answer is: do we want to allow users to still use xlrd (< 2.0) to read non-xls files, with warning that this is deprecated? (more or less what is now implemented in master) Or do we directly want to disallow this and raise an exception?

@cjw296 (xlrd maintainer) argues for directly disallowing, given the potential security issues.
And the reason we currently only deprecate it, is to give users a smoother upgrade experience: updating pandas will start raising a warning they should no longer use xlrd (only if openpyxl is not installed, otherwise that will already be used automatically), but not break their code.

I am personally leaning towards keeping the warning (but we could make the message stronger about the risks of further using it), but happy to go with the majority.

cc @WillAyd @roberthdevries @bashtage

cjw296 · 2020-12-16T13:10:56Z

I've opened #38522, so let's follow up there.

jorisvandenbossche · 2020-12-16T13:44:23Z

Let's keep the actual discussion on what general behaviour we want here (it's already scattered enough), specific code review can of course happen in #38522

cjw296 · 2020-12-16T13:49:26Z

From #38522: I've gone ahead and implemented the content-based engine inference I suggested, and I believe the outcome is now better all round:

.xlsb files will now pick the correct engine rather than falling through and returning an obscure error from xlrd
engines will now be correctly picked, even if the file is mis-named. (this happens more than many might realise: I wonder how many web sites are still serving up html tables at urls with .xls extensions because Excel will open this)
should a file be indecipherable, we'll still try based on file extension (there are some truly ancient excel files where this applies, but I couldn't find a .xls to add as a test...)
you'll now always get simple, clear optional dependency exceptions telling you which library you need to install.
you can still explicitly specify xlrd as the engine if you want to use if really want to stick on xlrd 1.2 and experience the bugs and security risks associated with it.
the code is, well, in my opinion at least, a lot simpler and clearer.

jorisvandenbossche · 2020-12-16T19:33:57Z

@cjw296 Thanks a lot for that! That's a nice clean-up / improvement in any case, regardless of whether we decide to fallback from openpyxl to xlrd or not (in case openpyxl is not installed)

I am hoping that some more people chime in here with their thoughts on this. As long as that doesn't happen, I don't want to decide changing the behaviour we added in the #35029 (if openpyxl is not installed, still continue to use xlrd to backcompat, but with warning about it being deprecated and the security issues).

jreback · 2020-12-16T22:35:50Z

about this comment

But so the main question we need to answer is: do we want to allow users to still use xlrd (< 2.0) to read non-xls files, with warning that this is deprecated? (more or less what is now implemented in master) Or do we directly want to disallow this and raise an exception?

I would say absolutely yes. just because you updated pandas doesn't mean we need to break because you don't update xlrd. these are two separate things and we don't hard break like this. While I appreciate @cjw296 desire to not support xlrd anymore which is totally fine. There are a great many intallations out there and its not great to just 'break them'. We are already deprecating this, so I don't think much to do (beyond handling the situation if xlrd >2 is installed, which its clear, we must break and point to openpyxl).

cjw296 · 2020-12-19T07:42:27Z

Just to echo my comment on #38571: I find the desire to support xlrd 1.2 at a time when pandas users are already upgrading a package to be both surprising and disappointing, given the potential security issues and poor parsing experience associated with sticking with xlrd 1.2.

This isn't a "hard break", no code changes are required for users, simply installing one more package at a time when you're already upgrading other packages.

It's probably worth noting too that bending over backwards to support a broken and insecure version of a package is going to leave pandas with more convoluted code in this area going forwards..

gfyoung · 2020-12-22T22:41:54Z

This is a tough call, but strictly from a package maintenance perspective, I'm inclined to take the more conservative approach that @jreback is advocating. Although it is extremely belated, an API change like this still risks downstream breakage, which is likely to cause more chaos and headache.

It's probably worth noting too that bending over backwards to support a broken and insecure version of a package is going to leave pandas with more convoluted code in this area going forwards.

Perhaps, but moments like those also present themselves as opportunities to get people to move over to openpyxl. Rather than acquiesce and try to fix with xlrd, challenge them first to use the preferred engine and then evaluate afterwards.

#38522 (comment)

Largely agree with what @jorisvandenbossche is saying, but I would also add that the "lukewarm" response @cjw296 you are receiving isn't because we are not interested in making the switch. We are trying to execute it carefully to minimize issues. The fact we were unaware of it for so long is unfortunate, but from a strategic POV, it also is a sunk cost at this point.

Deprecations in pandas are tracked and are eventually rectified in future releases. This deprecation cycle is not that different from others, and it will get done once we merge the PR's into master.

leo4183 added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 12, 2020

rhshadrach added Deprecate Functionality to remove in pandas Docs and removed Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 12, 2020

rhshadrach added this to the 1.2 milestone Dec 12, 2020

rhshadrach added the IO Excel read_excel, to_excel label Dec 12, 2020

rhshadrach self-assigned this Dec 12, 2020

Besler mentioned this issue Dec 12, 2020

Move from xlrd to openpyxl Bonelab/Bonelab#15

Merged

simonjayhawkins added the Blocker Blocking issue or pull request for an upcoming release label Dec 13, 2020

cjw296 mentioned this issue Dec 13, 2020

DOC: update wording about when xlrd engine can be used #38456

Merged

berland mentioned this issue Dec 14, 2020

XLRDError from most recent xlrd equinor/fmu-tools#88

Closed

asishm mentioned this issue Dec 15, 2020

ENH: Default Excel engine only compatible with XLS files -> replace by new one #38498

Closed

simonjayhawkins pinned this issue Dec 15, 2020

topper-123 unpinned this issue Dec 16, 2020

cjw296 mentioned this issue Dec 16, 2020

Better inference of spreadsheet formats. #38522

Closed

rhshadrach mentioned this issue Dec 18, 2020

DEPR: Adjust read excel behavior for xlrd >= 2.0 #38571

Merged

5 tasks

jreback closed this as completed in #38571 Dec 23, 2020

dgdekoning mentioned this issue Jan 7, 2021

Replace use of xlrd with openpyxl brightway-lca/brightway2-io#86

Closed

phofl mentioned this issue Jan 19, 2021

BUG: read_excel with openpyxl results in empty data frame #39001

Closed

1 task

georgerichardson mentioned this issue Nov 30, 2021

11 fetch preprocess gdp nestauk/industrial_taxonomy#16

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

shift default excel read engine from xlrd to openpyxl #38424

shift default excel read engine from xlrd to openpyxl #38424

leo4183 commented Dec 12, 2020

rhshadrach commented Dec 12, 2020

jorisvandenbossche commented Dec 12, 2020

cjw296 commented Dec 12, 2020

simonjayhawkins commented Dec 13, 2020

rhshadrach commented Dec 13, 2020 •

edited

Loading

jorisvandenbossche commented Dec 13, 2020 •

edited

Loading

rhshadrach commented Dec 13, 2020

cjw296 commented Dec 13, 2020 •

edited

Loading

jorisvandenbossche commented Dec 13, 2020

jorisvandenbossche commented Dec 13, 2020

cjw296 commented Dec 13, 2020

jorisvandenbossche commented Dec 13, 2020

cjw296 commented Dec 13, 2020

jorisvandenbossche commented Dec 13, 2020

cjw296 commented Dec 13, 2020

jorisvandenbossche commented Dec 15, 2020

jorisvandenbossche commented Dec 15, 2020 •

edited

Loading

cjw296 commented Dec 16, 2020 •

edited

Loading

jorisvandenbossche commented Dec 16, 2020

cjw296 commented Dec 16, 2020 •

edited

Loading

jorisvandenbossche commented Dec 16, 2020

jreback commented Dec 16, 2020

cjw296 commented Dec 19, 2020 •

edited

Loading

gfyoung commented Dec 22, 2020 •

edited

Loading

shift default excel read engine from xlrd to openpyxl #38424

shift default excel read engine from xlrd to openpyxl #38424

Comments

leo4183 commented Dec 12, 2020

rhshadrach commented Dec 12, 2020

jorisvandenbossche commented Dec 12, 2020

cjw296 commented Dec 12, 2020

simonjayhawkins commented Dec 13, 2020

rhshadrach commented Dec 13, 2020 • edited Loading

jorisvandenbossche commented Dec 13, 2020 • edited Loading

rhshadrach commented Dec 13, 2020

cjw296 commented Dec 13, 2020 • edited Loading

jorisvandenbossche commented Dec 13, 2020

jorisvandenbossche commented Dec 13, 2020

cjw296 commented Dec 13, 2020

jorisvandenbossche commented Dec 13, 2020

cjw296 commented Dec 13, 2020

jorisvandenbossche commented Dec 13, 2020

cjw296 commented Dec 13, 2020

jorisvandenbossche commented Dec 15, 2020

jorisvandenbossche commented Dec 15, 2020 • edited Loading

cjw296 commented Dec 16, 2020 • edited Loading

jorisvandenbossche commented Dec 16, 2020

cjw296 commented Dec 16, 2020 • edited Loading

jorisvandenbossche commented Dec 16, 2020

jreback commented Dec 16, 2020

cjw296 commented Dec 19, 2020 • edited Loading

gfyoung commented Dec 22, 2020 • edited Loading

rhshadrach commented Dec 13, 2020 •

edited

Loading

jorisvandenbossche commented Dec 13, 2020 •

edited

Loading

cjw296 commented Dec 13, 2020 •

edited

Loading

jorisvandenbossche commented Dec 15, 2020 •

edited

Loading

cjw296 commented Dec 16, 2020 •

edited

Loading

cjw296 commented Dec 16, 2020 •

edited

Loading

cjw296 commented Dec 19, 2020 •

edited

Loading

gfyoung commented Dec 22, 2020 •

edited

Loading