Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: update wording about when xlrd engine can be used #38456

Merged
merged 8 commits into from
Dec 23, 2020
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 6 additions & 9 deletions doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,21 +10,18 @@ including other versions of pandas.

.. warning::

The packages `xlrd <https://xlrd.readthedocs.io/en/latest/>`_ for reading excel
files and `xlwt <https://xlwt.readthedocs.io/en/latest/>`_ for
writing excel files are no longer maintained. These are the only engines in pandas
that support the xls format.
The `xlwt <https://xlwt.readthedocs.io/en/latest/>`_ package for writing old-style ``.xls``
excel files is no longer maintained.
The `xlrd <https://xlrd.readthedocs.io/en/latest/>`_ package is now only for reading
old-style ``.xls`` files.

Previously, the default argument ``engine=None`` to ``pd.read_excel``
would result in using the ``xlrd`` engine in many cases. If
`openpyxl <https://openpyxl.readthedocs.io/en/stable/>`_ is installed,
many of these cases will now default to using the ``openpyxl`` engine.
See the :func:`read_excel` documentation for more details. Attempting to read
``.xls`` files or specifying ``engine="xlrd"`` to ``pd.read_excel`` will not
raise a warning. However users should be aware that ``xlrd`` is already
broken with certain package configurations, for example with Python 3.9
when `defusedxml <https://github.com/tiran/defusedxml/>`_ is installed, and
is anticipated to be unusable in the future.
anything but ``.xls`` files when specifying ``engine="xlrd"`` to ``pd.read_excel`` will raise
an exception.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's only the case with the latest xlrd, though, I think?
(while many users will still have xlrd < 2.0)

Copy link
Contributor Author

@cjw296 cjw296 Dec 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, but if I could, I would totally enforce the use of only xlrd >= 2.0 now that it is out.
To be explicit: that comes from concerns I have, particularly around security, about code that hasn't been maintained for years being used to interact with formats (zip and xml, both of which are used in xlsx) that have known security issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjw296 yes, but we cann't force anyone to update (and in all likely hood they will simply update pandas and continue using the current version of xlrd).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback - I mean, in terms of implementation, you can: raise an exception if xlrd.__version__ < '2'. Given that people will only hit this when they're already upgrading at least one package, ie: pandas, this feels reasonable to me, but pandas is not my project, so while I'd be disappointed at the security risk you're prepared to expose your users to, I'd have to accept it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so while I'd be disappointed at the security risk you're prepared to expose your users to, I'd have to accept it.

believe me users are way more exposed to security risks that this. its hard to immediately just do something which causes current code to break. yes I get your point, but the reverse is true too. people update pandas and not other packages and expect things to work. we generally like to warn if at all possible first.


Attempting to use the the ``xlwt`` engine will raise a ``FutureWarning``
unless the option :attr:`io.excel.xls.writer` is set to ``"xlwt"``.
Expand Down
26 changes: 12 additions & 14 deletions pandas/io/excel/_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,25 +105,24 @@
Supported engines: "xlrd", "openpyxl", "odf", "pyxlsb".
Engine compatibility :

- "xlrd" supports most old/new Excel file formats.
- "xlrd" supports old-style Excel files (.xls).
- "openpyxl" supports newer Excel file formats.
- "odf" supports OpenDocument file formats (.odf, .ods, .odt).
- "pyxlsb" supports Binary Excel files.

.. versionchanged:: 1.2.0
The engine `xlrd <https://xlrd.readthedocs.io/en/latest/>`_
is no longer maintained, and is not supported with
python >= 3.9. When ``engine=None``, the following logic will be
used to determine the engine.
now only supports old-style ``.xls`` files.
When ``engine=None``, the following logic will be
used to determine the engine:

- If ``path_or_buffer`` is an OpenDocument format (.odf, .ods, .odt),
then `odf <https://pypi.org/project/odfpy/>`_ will be used.
- Otherwise if ``path_or_buffer`` is a bytes stream, the file has the
- Otherwise if the file has the
extension ``.xls``, or is an ``xlrd`` Book instance, then ``xlrd`` will
be used.
- Otherwise if `openpyxl <https://pypi.org/project/openpyxl/>`_ is installed,
then ``openpyxl`` will be used.
- Otherwise ``xlrd`` will be used and a ``FutureWarning`` will be raised.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not made the code changes related to this line or the change on line 121 (these changes are echoed in the section around line ~900), but that's what I think the approach should be.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, so that will be the approach in the future. But in pandas we decided to first raise a warning about it (in case the user only upgraded pandas, and not xlrd).
So for now we should keep this sentence (but maybe we should add something like "... and this ability will be removed in a future version" to make this more explicit)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I respect the choice of approach, I can't really condone it. As I said above: I have concerns around security with xlrd<2, and so anything that can be done to stop people using it would be a very good thing.

If a user is upgrading such that they have pandas 1.2, that would be a very good time for them to also upgrade to xlrd 2 and whatever the latest openpyxl release is...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not made the code changes related to this..., but that's what I think the approach should be.

Just so I understand the intentions here, are you planning to make those code changes (taking into account @jorisvandenbossche request)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an open question, see #38424 (comment)


Specifying ``engine="xlrd"`` will continue to be allowed for the
indefinite future.
Expand Down Expand Up @@ -920,7 +919,7 @@ class ExcelFile:
"""
Class for parsing tabular excel sheets into DataFrame objects.

Uses xlrd engine by default. See read_excel for more documentation
See read_excel for more documentation

Parameters
----------
Expand All @@ -933,26 +932,25 @@ class ExcelFile:
Supported engines: ``xlrd``, ``openpyxl``, ``odf``, ``pyxlsb``
Engine compatibility :

- ``xlrd`` supports most old/new Excel file formats.
- ``xlrd`` old-style Excel files (.xls).
- ``openpyxl`` supports newer Excel file formats.
- ``odf`` supports OpenDocument file formats (.odf, .ods, .odt).
- ``pyxlsb`` supports Binary Excel files.

.. versionchanged:: 1.2.0

The engine `xlrd <https://xlrd.readthedocs.io/en/latest/>`_
is no longer maintained, and is not supported with
python >= 3.9. When ``engine=None``, the following logic will be
used to determine the engine.
The engine `xlrd <https://xlrd.readthedocs.io/en/latest/>`_
now only supports old-style ``.xls`` files.
When ``engine=None``, the following logic will be
used to determine the engine:

- If ``path_or_buffer`` is an OpenDocument format (.odf, .ods, .odt),
then `odf <https://pypi.org/project/odfpy/>`_ will be used.
- Otherwise if ``path_or_buffer`` is a bytes stream, the file has the
- Otherwise if the file has the
extension ``.xls``, or is an ``xlrd`` Book instance, then ``xlrd``
will be used.
- Otherwise if `openpyxl <https://pypi.org/project/openpyxl/>`_ is installed,
then ``openpyxl`` will be used.
- Otherwise ``xlrd`` will be used and a ``FutureWarning`` will be raised.

Specifying ``engine="xlrd"`` will continue to be allowed for the
indefinite future.
Expand Down