-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
shift default excel read engine from xlrd to openpyxl #38424
Comments
Thanks for raising this, I was not aware of the recent changes to xlrd. In pandas 1.2, if you have openpyxl installed it will be used as the default for reading non-xls files. See https://pandas.pydata.org/pandas-docs/dev/whatsnew/v1.2.0.html for more info. With the change to xlrd, I think the messages in the whatsnew and potentially warning messages need to be updated. cc @jorisvandenbossche @jreback |
@rhshadrach see #28547 (comment) and my answer below. I think we indeed should check for this version in pandas, and update the error message (and whatsnew message) to make it clearer. |
I'm sorry for causing you folk extra work :-( |
added the blocker label. |
@jorisvandenbossche My understanding is that, other than ods-streams, only xlrd can read buffer-like values (i.e. non-path) of
Then:
One case I want to highlight is when read_excel is passed a buffer to a non-xls/ods format. With the logic above, we'd be passing this onto xlrd without warning. I don't know if there is a good way to tell if the buffer can handled by xlrd. |
But only if it is a buffer of a file type it does support? Eg with xlrd >= 2.0, it won't read a buffer-like of an xlsx file?
Yes, that sounds correct, because for those conditions,
Yeah, see @cjw296's comment here: #28547 (comment). He mentions an |
Thanks @jorisvandenbossche, I had missed the inspect_format part. I think that solves the part where I had reservations. |
@rhshadrach - I'm the maintainer of xlrd, so I wanted to be clear on the intention here: xlrd should only be used for It absolutely should not be used for opening Please, I'm begging you, do not put in hacks which try and suggest users can get away with "sticking with xlrd 1.2" or whatever... |
That's already the case on master / on 1.2rc, see #35029 The whatsnew notice about this (the first warning box in the 1.2 release notes) can be seen here: https://pandas.pydata.org/docs/dev/whatsnew/v1.2.0.html |
Actually, re-reading that, I suppose we do not raise a warning when you manually specify |
I've thrown in #38456 for my preferred wording related to this, and also how I think pandas should behave, but to summarise here:
FWIW, you may well want to copy the inspection code and the one constant it needs into pandas, for which you have my explicit permission. That would allow you to dispatch bytes streams using the same logic I describe above. |
As far as I understand it, that's exactly how pandas now behaves (except for the potential warning if |
I'm afraid not:
That second one might be made more palatable by using the code I linked to in order to pick a better engine, in the case where you're trying to read a byte stream or from a file that has an incorrect or not-present file extension. |
Yes, that's the one case that @rhshadrach raised above as well (#38424 (comment)), and for which we gladly use the code in xlrd to better detect whether it is actually a xls file or not.
As I also just commented in #38456 about this: in pandas we decided to do this change with a deprecation warning. So we will raise an exception in the future when xlrd gets being used for non-xls files; but for now only a warning. I should have been clearer, and say "that's exactly how pandas now intents to behave" (as far as I understand it), so taking into account that we first do a warning before raising an exception, and except for the corner cases being discussed in this issue (for buffer objects we don't yet check the exact file type + also when specifying |
Just to echo what I said in the PR comments:
If I worked up a PR to do the guessing |
@jreback said (#38456 (review)):
|
Indeed, as mentioned above, I think we should check for this case, so we can raise a custom, pandas specific error message that is more informative for the pandas case (pandas supports other formats like xlsx, the error message from xlrd could give the impression that we don't) But so the main question we need to answer is: do we want to allow users to still use xlrd (< 2.0) to read non-xls files, with warning that this is deprecated? (more or less what is now implemented in master) Or do we directly want to disallow this and raise an exception? @cjw296 (xlrd maintainer) argues for directly disallowing, given the potential security issues. I am personally leaning towards keeping the warning (but we could make the message stronger about the risks of further using it), but happy to go with the majority. |
I've opened #38522, so let's follow up there. |
Let's keep the actual discussion on what general behaviour we want here (it's already scattered enough), specific code review can of course happen in #38522 |
From #38522: I've gone ahead and implemented the content-based engine inference I suggested, and I believe the outcome is now better all round:
|
@cjw296 Thanks a lot for that! That's a nice clean-up / improvement in any case, regardless of whether we decide to fallback from openpyxl to xlrd or not (in case openpyxl is not installed) I am hoping that some more people chime in here with their thoughts on this. As long as that doesn't happen, I don't want to decide changing the behaviour we added in the #35029 (if openpyxl is not installed, still continue to use xlrd to backcompat, but with warning about it being deprecated and the security issues). |
about this comment
I would say absolutely yes. just because you updated pandas doesn't mean we need to break because you don't update xlrd. these are two separate things and we don't hard break like this. While I appreciate @cjw296 desire to not support xlrd anymore which is totally fine. There are a great many intallations out there and its not great to just 'break them'. We are already deprecating this, so I don't think much to do (beyond handling the situation if xlrd >2 is installed, which its clear, we must break and point to openpyxl). |
Just to echo my comment on #38571: I find the desire to support xlrd 1.2 at a time when pandas users are already upgrading a package to be both surprising and disappointing, given the potential security issues and poor parsing experience associated with sticking with xlrd 1.2. This isn't a "hard break", no code changes are required for users, simply installing one more package at a time when you're already upgrading other packages. It's probably worth noting too that bending over backwards to support a broken and insecure version of a package is going to leave pandas with more convoluted code in this area going forwards.. |
This is a tough call, but strictly from a package maintenance perspective, I'm inclined to take the more conservative approach that @jreback is advocating. Although it is extremely belated, an API change like this still risks downstream breakage, which is likely to cause more chaos and headache.
Perhaps, but moments like those also present themselves as opportunities to get people to move over to Largely agree with what @jorisvandenbossche is saying, but I would also add that the "lukewarm" response @cjw296 you are receiving isn't because we are not interested in making the switch. We are trying to execute it carefully to minimize issues. The fact we were unaware of it for so long is unfortunate, but from a strategic POV, it also is a sunk cost at this point. Deprecations in |
since v2.0.0, xlrd no longer supports excel files other than ".xls". manually specify pd.read_excel engine (to openpyxl and etc) is a bit annoying (otherwise pandas would complain about the missing xlrd module). is it possible to shift the default engine to sth more commonly used (eg. openpyxl )
The text was updated successfully, but these errors were encountered: