Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: update wording about when xlrd engine can be used #38456

Merged
merged 8 commits into from
Dec 23, 2020

Conversation

cjw296
Copy link
Contributor

@cjw296 cjw296 commented Dec 13, 2020

extension ``.xls``, or is an ``xlrd`` Book instance, then ``xlrd`` will
be used.
- Otherwise if `openpyxl <https://pypi.org/project/openpyxl/>`_ is installed,
then ``openpyxl`` will be used.
- Otherwise ``xlrd`` will be used and a ``FutureWarning`` will be raised.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not made the code changes related to this line or the change on line 121 (these changes are echoed in the section around line ~900), but that's what I think the approach should be.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, so that will be the approach in the future. But in pandas we decided to first raise a warning about it (in case the user only upgraded pandas, and not xlrd).
So for now we should keep this sentence (but maybe we should add something like "... and this ability will be removed in a future version" to make this more explicit)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I respect the choice of approach, I can't really condone it. As I said above: I have concerns around security with xlrd<2, and so anything that can be done to stop people using it would be a very good thing.

If a user is upgrading such that they have pandas 1.2, that would be a very good time for them to also upgrade to xlrd 2 and whatever the latest openpyxl release is...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not made the code changes related to this..., but that's what I think the approach should be.

Just so I understand the intentions here, are you planning to make those code changes (taking into account @jorisvandenbossche request)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an open question, see #38424 (comment)

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for those edits!

when `defusedxml <https://github.com/tiran/defusedxml/>`_ is installed, and
is anticipated to be unusable in the future.
anything but ``.xls`` files when specifying ``engine="xlrd"`` to ``pd.read_excel`` will raise
an exception.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's only the case with the latest xlrd, though, I think?
(while many users will still have xlrd < 2.0)

Copy link
Contributor Author

@cjw296 cjw296 Dec 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, but if I could, I would totally enforce the use of only xlrd >= 2.0 now that it is out.
To be explicit: that comes from concerns I have, particularly around security, about code that hasn't been maintained for years being used to interact with formats (zip and xml, both of which are used in xlsx) that have known security issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjw296 yes, but we cann't force anyone to update (and in all likely hood they will simply update pandas and continue using the current version of xlrd).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback - I mean, in terms of implementation, you can: raise an exception if xlrd.__version__ < '2'. Given that people will only hit this when they're already upgrading at least one package, ie: pandas, this feels reasonable to me, but pandas is not my project, so while I'd be disappointed at the security risk you're prepared to expose your users to, I'd have to accept it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so while I'd be disappointed at the security risk you're prepared to expose your users to, I'd have to accept it.

believe me users are way more exposed to security risks that this. its hard to immediately just do something which causes current code to break. yes I get your point, but the reverse is true too. people update pandas and not other packages and expect things to work. we generally like to warn if at all possible first.

extension ``.xls``, or is an ``xlrd`` Book instance, then ``xlrd`` will
be used.
- Otherwise if `openpyxl <https://pypi.org/project/openpyxl/>`_ is installed,
then ``openpyxl`` will be used.
- Otherwise ``xlrd`` will be used and a ``FutureWarning`` will be raised.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, so that will be the approach in the future. But in pandas we decided to first raise a warning about it (in case the user only upgraded pandas, and not xlrd).
So for now we should keep this sentence (but maybe we should add something like "... and this ability will be removed in a future version" to make this more explicit)

@jorisvandenbossche jorisvandenbossche changed the title adjust docs about xlrd changes to match my intentions DOC: update wording about when xlrd engine can be used Dec 13, 2020
@jorisvandenbossche jorisvandenbossche added the IO Excel read_excel, to_excel label Dec 13, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.2 milestone Dec 13, 2020
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these changes look fine to me. but don't we still have the question of what happens when > 2 xlrd is installed and our error conditions are triggered? IIUC this will simply raise the value error from inside xlrd, which is fine as the message is helpful, but are we also triggering the warning? maybe we should catch the exception and raise a nice warning say you should just use openpyxl?

do we have testing with > 2 xlrd?

when `defusedxml <https://github.com/tiran/defusedxml/>`_ is installed, and
is anticipated to be unusable in the future.
anything but ``.xls`` files when specifying ``engine="xlrd"`` to ``pd.read_excel`` will raise
an exception.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjw296 yes, but we cann't force anyone to update (and in all likely hood they will simply update pandas and continue using the current version of xlrd).

@cjw296
Copy link
Contributor Author

cjw296 commented Dec 15, 2020

these changes look fine to me. but don't we still have the question of what happens when > 2 xlrd is installed and our error conditions are triggered? IIUC this will simply raise the value error from inside xlrd, which is fine as the message is helpful, but are we also triggering the warning? maybe we should catch the exception and raise a nice warning say you should just use openpyxl?

Unfortunately, the discussion around this is a little fragmented, I think #38424 (comment) addresses this. If I worked up a PR for the code I suggest there, would that be accepted? Also, would it be okay to add it as a separate commit to this PR or would you prefer a separate one?

@jorisvandenbossche
Copy link
Member

Let's keep the actual discussion about what behaviour we want in #38424 (since most happened there). @cjw296 would you be fine to focus this PR to only better document the current situation? (i.e. the change to clarify that xlrd only supports xls files)

@jreback
Copy link
Contributor

jreback commented Dec 16, 2020

I think this update is fine (and for 1.2), unless objections for THIS PR, @jorisvandenbossche @simonjayhawkins ok merging (as I think this is the only outstanding issue for 1.2)

@rhshadrach
Copy link
Member

@jreback - This PR changes the documented behavior but does not change the code; it will make the code and docs out of sync.

@cjw296
Copy link
Contributor Author

cjw296 commented Dec 16, 2020

@jreback - ideally, I'd like to land #38522, then update this PR with whatever its outcome is, then land that.

@jorisvandenbossche
Copy link
Member

@jreback To be clear, this PR is not ready to merge, as long as we don't decide on the discussion points (please respond in #38424 ), as the current state of this PR does not reflect the situation as it is in master.

@cjw296
Copy link
Contributor Author

cjw296 commented Dec 16, 2020

@jorisvandenbossche - I think #38522 is the place to bottom things out now.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pending discussion

@jorisvandenbossche
Copy link
Member

I removed the subset of the changes that doesn't match with the actual behaviour in master, so now it should be good to go
(I know this is also somewhat included in #38571, but I think it is nice to attribute it to @cjw296 )

@jorisvandenbossche
Copy link
Member

@cjw296 thanks for the update. I just removed the .. warning because it was already in a warning box (otherwise we could get a box in a box ..); and I copied the full warning box also to the user guide (so it's not only in the whatsnew, but also in the main documentation)

@cjw296
Copy link
Contributor Author

cjw296 commented Dec 23, 2020

Ah sorry, yes, that was a mistake on my part. I do feel you've watered down the warning though: I feel people should be warned that opening xlsx with xlrd is explicitly unsupported, and to upgrade rather than logging issues about that.

@jorisvandenbossche
Copy link
Member

Is it about the "no longer" that I added? Happy to remove that again

@jorisvandenbossche
Copy link
Member

(for the rest I kept your sentence intact. But of course it's now within a larger warning box, instead of only that sentence, so therefore maybe a bit less standing out)

Thus, it is strongly encouraged to install ``openpyxl`` to read Excel 2007+
(``.xlsx``) files.
Please do not report issues when using ``xlrd`` to read ``.xlsx`` files.
This is no longer supported, switch to using ``openpyxl`` instead.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer these two lines to be further up, around 2938, and in their own paragraph. I mean, I'd prefer them to be bright red and bold, maybe flashing too, but I doubt you'd agree to that ;-)

would result in using the ``xlrd`` engine in many cases. If
`openpyxl <https://openpyxl.readthedocs.io/en/stable/>`_ is installed,
Previously, the default argument ``engine=None`` to :func:`~pandas.read_excel`
would result in using the ``xlrd`` engine in many cases, also for new
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"including" might be better than "also for" here.

@cjw296
Copy link
Contributor Author

cjw296 commented Dec 23, 2020

@jorisvandenbossche - not really sure why you're doing stuff on this PR, #38522 has been closed in favour of the, quite frankly, inferior #38571. I'm kinda done with helping pandas, this has been an extremely negative experience for me, I would welcome a discussion in private, but I couldn't find your email address anywhere. Mine is chris@python.org if you fancy a chat.

would result in using the ``xlrd`` engine in many cases, also for new
would result in using the ``xlrd`` engine in many cases, including for new
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to be pedantic, but the change I suggested was "also for" -> "including", not "also" -> "including".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, sorry, read to quickly

@jorisvandenbossche
Copy link
Member

not really sure why you're doing stuff on this PR, #38522 has been closed

This PR is not tied to #38522 or #38571. The reason I am pushing to this PR is because I think there were valuable improvements here to the docs, which I wanted to get merged. But for that, the subset that didn't match the actual behaviour in master needed to be removed. So that's why I started pushing that (thinking you would be reluctant to do it yourself)

@jorisvandenbossche jorisvandenbossche merged commit bd3cd37 into pandas-dev:master Dec 23, 2020
@simonjayhawkins
Copy link
Member

@meeseeksdev backport 1.2.x

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Dec 23, 2020
simonjayhawkins pushed a commit that referenced this pull request Dec 23, 2020
… used (#38660)

Co-authored-by: Chris Withers <chris@simplistix.co.uk>
luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021
)

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants