Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: Deprecate using xlrd engine for read_excel #35029

Merged
merged 16 commits into from
Dec 1, 2020

Conversation

roberthdevries
Copy link
Contributor

This is MR #29375 but rebased to master

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd WillAyd added the IO Excel read_excel, to_excel label Jun 29, 2020
pandas/io/excel/_base.py Show resolved Hide resolved
@@ -852,6 +853,14 @@ def __init__(self, path_or_buffer, engine=None):
ext = os.path.splitext(str(path_or_buffer))[-1]
if ext == ".ods":
engine = "odf"

if engine == "xlrd":

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning should not be issued when the parameter engine="xlrd" is passed explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, if the engine is deprecated, I would expect that all uses should be discouraged. Explicit or implicit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xlrd is the only thing that will read legacy .xls files unfortunately, so I don't think we need to outright remove all usage of it but want the default to switch to openpyxl

Copy link
Contributor Author

@roberthdevries roberthdevries Jun 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So are we already switching to openpyxl for everything other than .xls files (except of course .ods files and maybe .xlsb files)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea xlsx and xlsm files (the former I would hope is what the vast majority of people read nowadays)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now changed the default engine to openpyxl and added a check to use xlrd for .xls files
This required quite some changes to the tests and a work-around for a rounding error in openpyxl.
See https://foss.heptapod.net/openpyxl/openpyxl/-/issues/1493

@roberthdevries roberthdevries force-pushed the fix-28547-deprecate-xlrd branch 2 times, most recently from 9e6474d to 40fbf53 Compare July 1, 2020 07:35
@@ -844,14 +844,24 @@ class ExcelFile:

def __init__(self, path_or_buffer, engine=None):
if engine is None:
engine = "xlrd"
engine = "openpyxl"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this actually changes the engine; I think the first step is to provide the warning that you have below that by default in the future we will switch to using openpyxl

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making an exception for .xls files which have to use xlrd

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that means that a warning shall only be produced for .xlsx and .xlsm files that use xlrd?
And regarding the other remark about switching the engines, that was what you asked for a couple of comments back?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right - warn first then change over time

Copy link
Contributor Author

@roberthdevries roberthdevries Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the change to use the openpyxl engine as the default has to be reverted? Or it is just that the xlrd engine is going to be removed in the future altogether and with that the support for .xls files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd Should I revert the change to make openpyxl the default engine?
And only warn when using xlrd in combination with .xlsx or .xlsm files?

@roberthdevries roberthdevries force-pushed the fix-28547-deprecate-xlrd branch 2 times, most recently from 8ed0652 to fad02a5 Compare July 3, 2020 20:23
@WillAyd
Copy link
Member

WillAyd commented Jul 3, 2020 via email

@jreback
Copy link
Contributor

jreback commented Jul 10, 2020

@roberthdevries canyou update to comments and merge master

@simonjayhawkins
Copy link
Member

@roberthdevries can you address comments?

@roberthdevries
Copy link
Contributor Author

Not until I am back from vacation in three weeks.

@roberthdevries
Copy link
Contributor Author

I have addressed the remaining comment from @WillAyd to only warn about the pending deprecation.

@@ -143,6 +143,7 @@ See :ref:`install.dependencies` and :ref:`install.optional_dependencies` for mor
Deprecations
~~~~~~~~~~~~
- Deprecated parameter ``inplace`` in :meth:`MultiIndex.set_codes` and :meth:`MultiIndex.set_levels` (:issue:`35626`)
- :func:`read_excel` "xlrd" engine is deprecated for all file types that can be handled by "openpyxl" because "xlrd" is no longer maintained (:issue:`28547`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you reword for the benefit end users.

i.e. the default engine for read_excel is changing the the future

maybe say something like openpyxl is the recommended engine as xlrd is no longer maintained

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 521 to 525
try:
# workaround for inaccurate timestamp notation in excel
return datetime.fromtimestamp(round(cell.value.timestamp()))
except (AttributeError, OSError):
return cell.value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this changing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a work-around for a bug in openpyxl (https://foss.heptapod.net/openpyxl/openpyxl/-/issues/1493), but is only apparent when you do a round trip save to xlsx and read back xlsx using openpyxl.
As this is not tested in any unit test, this can be removed. Agreed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If adding code we should have a test. so either need to add test or can remove.

my preference would be to have this in a separate PR, so should raise pandas issue for this if removing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, remove it here and make a separate PR that includes a regression test.

@roberthdevries roberthdevries force-pushed the fix-28547-deprecate-xlrd branch from f740ed7 to 081ecf8 Compare August 26, 2020 09:02
@@ -144,6 +144,8 @@ Deprecations
~~~~~~~~~~~~
- Deprecated parameter ``inplace`` in :meth:`MultiIndex.set_codes` and :meth:`MultiIndex.set_levels` (:issue:`35626`)
- Deprecated parameter ``dtype`` in :~meth:`Index.copy` on method all index classes. Use the :meth:`Index.astype` method instead for changing dtype(:issue:`35853`)
- :func:`read_excel` "xlrd" engine is deprecated. The recommended engine is "openpyxl" for "xlsx" and "xlsm" files, because "xlrd" is no longer maintained (:issue:`28547`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for updating. it's the ``read_excel "xlrd" engine is deprecated bit that I wanted removed

IIUC the xlrd is not deprecated. it's only that that default engine used will change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it deprecated for use with xlsx files? IOW in the future, will only openpyxl be supported for xlsx? Sounds reasonable to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, that's not my understanding of the discussion in #28547

from #28547 (comment)

Considering that I think we need to deprecate using xlrd in favor of openpyxl. We might not necessarily need to remove the former and it does offer some functionality the latter doesn't (namely reading .xls files) but should at the very least start moving towards the latter

Indeed, the discussion has not explicitly mentioned disallowing xlrd for formats that openpyxl supports. but if the xlrd engine is not removed, we should decide now whether we would restrict it's use.

Copy link
Contributor

@bashtage bashtage Aug 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My read of this is to only keep xlrd for xls where it is required, and to deprecate where it is not. In the long run, if xlrd breaks and no one takes over its maintenance, then we will either have to vendor xlrd or remove support for xls. In either case minimizing use of xlrd seems like a good idea to me.

I suppose to me the only point of deprecating is to start a path to removal of a feature. If there is not going to be removal in the future, then why bother with deprecation?

One could always discourage xlrd + xlsx with a noisy FutureWarning telling users that xlrd is unmaintained and they should install openpyxl for reading xlsx files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd is very specific regarding this issue. He states that we should warn first, then change the default (and maybe even remove the xlrd engine for xlsx and xlsm files altogether).
But not at this point, only a deprecation warning is asked, to notify users that this engine is no longer the preferred engine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify we should only be changing the default reader to openpyxl. I think it's fine to keep xlrd around as a YMMV situation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I just reverted the changes that made openpyxl the default. I am now very confused. I thought that this change was just about warning people about pending deprecation of the xlrd reader, not to switch them over already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See your comments of July 1 and 3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see where the wording is confusing, but yes we only warn now and change in the future. We always manage user-facing changes that way

engine = "odf"

elif engine == "xlrd" and ext in ("xlsx", "xlsm"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning should be in the if engine is None branch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? This will mean that that people also get a warning when they ask for the default (which is still xlrd), instead of when they explicitly ask for xlrd.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea so the point of it is that people who want to suppress the warnings will get a head start and explicitly request engine="openpyxl", which is a good thing to sniff out any bugs

@luk-f-a
Copy link

luk-f-a commented Aug 26, 2020

just because I didn't see it mentioned above, I thought I'd mention that openpyxl is 10x slower. So it's a good thing that the default is not being changed, because otherwise a lot of user would see an impact after upgrading.

@erfannariman
Copy link
Member

just because I didn't see it mentioned above, I thought I'd mention that openpyxl is 10x slower. So it's a good thing that the default is not being changed, because otherwise a lot of user would see an impact after upgrading.

Interesting, what did you base this on?

@luk-f-a
Copy link

luk-f-a commented Sep 17, 2020

Interesting, what did you base this on?

Testing based on files I'm working on. After seeing the deprecation notice, we switched our code, and found out our CI was taking one hour longer (!). We traced it back to openpyxl. Fortunately we were aware we had done that change. If pandas had changed the default silently, it would have taken us a long time to figure out what was going on.

@simonjayhawkins
Copy link
Member

@roberthdevries This branch has conflicts that must be resolved. can you also address @WillAyd comments. #35029 (review)

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@rhshadrach
Copy link
Member

@jreback @jorisvandenbossche

Changes made. I've adjusted the tests only as necessary to either (i) test for openpyxl as being the default or (ii) specified engine="xlrd" if openpyxl would fail and either it seemed the test was specifically for xlrd or there was no clear way to resolve the test otherwise.

I don't know how to write a test for the FutureWarning - in testing environment, I think openpyxl will always be installed and so it will never be raised. As such, I've simply removed the tests for the FutureWarning for now.

As mentioned above, I won't be available until 16:00 EST (21:00 UTC). Anyone is of course welcomed to push this over the finish line.

@jreback
Copy link
Contributor

jreback commented Nov 30, 2020

we don't have either installed in the numpy dev environment so can certainly add some basic tests that run (likely we skip everything if we don't have either installed)

@simonjayhawkins simonjayhawkins added the Blocker Blocking issue or pull request for an upcoming release label Nov 30, 2020
@rhshadrach
Copy link
Member

Linux py37_minimum_versions failure is pandas/tests/resample/test_deprecated.py:98, unrelated.
macOS py37 failed to start, trying again.
All other builds passed.

@rhshadrach
Copy link
Member

/azp run

@azure-pipelines
Copy link
Contributor

Azure Pipelines successfully started running 1 pipeline(s).

@rhshadrach
Copy link
Member

@jreback tests pass; two failures below on Linux py38_np_dev are unrelated and passed on the previous run.

FAILED pandas/tests/series/indexing/test_indexing.py::test_loc_setitem_2d_to_1d_raises
FAILED pandas/tests/util/test_show_versions.py::test_show_versions - Assertio...

@simonjayhawkins
Copy link
Member

@jreback tests pass; two failures below on Linux py38_np_dev are unrelated and passed on the previous run.

FAILED pandas/tests/series/indexing/test_indexing.py::test_loc_setitem_2d_to_1d_raises
FAILED pandas/tests/util/test_show_versions.py::test_show_versions - Assertio...

these tests are failing on other PRs too.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some formatting considerations. ping when pushed as ok for merge. cc @jorisvandenbossche

.. warning::

Previously, the default argument ``engine=None`` to ``pd.read_excel``
would result in using the xlrd engine in many cases. The engine xlrd is no longer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double back-tick on xlrd (alt can put a link to xlrd itself, e.g. https://xlrd.readthedocs.io/en/latest/)

following logic is now used to determine the engine.

- If ``path_or_buffer`` is an OpenDocument format (.odf, .ods, .odt), then odf will be used.
- Otherwise if ``path_or_buffer`` is a bytes stream, the file has the extension ``.xls``, or is an xlrd Book instance, then xlrd will be used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double backtick xlrd / odf (only put the docs link on L14)

- "xlrd" supports most old/new Excel file formats.
- "openpyxl" supports newer Excel file formats.
- "odf" supports OpenDocument file formats (.odf, .ods, .odt).
- "pyxlsb" supports Binary Excel files.

.. versionchanged:: 1.2.0
The engine xlrd is no longer maintained, and is not supported with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think need a blank line here to render (make this section the same as in the whatsnew as per formatting)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, for versionchanged this is OK (rst .. ;-))

maintained, and is not supported with python >= 3.9. When ``engine=None``, the
following logic is now used to determine the engine.

- If ``path_or_buffer`` is an OpenDocument format (.odf, .ods, .odt), then odf will be used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


- If ``path_or_buffer`` is an OpenDocument format (.odf, .ods, .odt), then odf will be used.
- Otherwise if ``path_or_buffer`` is a bytes stream, the file has the extension ``.xls``, or is an xlrd Book instance, then xlrd will be used.
- Otherwise if openpyxl is installed, then openpyxl will be used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python >= 3.9. When ``engine=None``, the following logic will be
used to determine the engine.

- If ``path_or_buffer`` is an OpenDocument format (.odf, .ods, .odt),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obviously as much of the formatting you can do here as well

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comment on the whatsnew, but it's perfectly fine to only address this later after the RC as well, it's not a blocker

- "xlrd" supports most old/new Excel file formats.
- "openpyxl" supports newer Excel file formats.
- "odf" supports OpenDocument file formats (.odf, .ods, .odt).
- "pyxlsb" supports Binary Excel files.

.. versionchanged:: 1.2.0
The engine xlrd is no longer maintained, and is not supported with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, for versionchanged this is OK (rst .. ;-))

Comment on lines 18 to 21
- If ``path_or_buffer`` is an OpenDocument format (.odf, .ods, .odt), then odf will be used.
- Otherwise if ``path_or_buffer`` is a bytes stream, the file has the extension ``.xls``, or is an xlrd Book instance, then xlrd will be used.
- Otherwise if openpyxl is installed, then openpyxl will be used.
- Otherwise xlrd will be used and a ``FutureWarning`` will be raised.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would maybe rearrange this list: the most important piece of information we want to convey here is that for xlsx files the default changed from xlrd to openpyxl, if installed. So I would also put that on top of the list (or keep it just to this for the whatsnew, as the other items didn't change. The full list is still in the actual docs).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't actually look at the extension or the file format when determining the engine in various cases, so it isn't just changing for xlsx files, right? What do you think of this:

Previously, the default argument ``engine=None`` to ``pd.read_excel``
would result in using the `xlrd <https://xlrd.readthedocs.io/en/latest/>`_ engine
in many cases. The engine ``xlrd`` is no longer
maintained, and is not supported with python >= 3.9. If 
`openpyxl <https://pypi.org/project/openpyxl/>`_  is installed, many of these 
cases will now default to using the ``openpyxl`` engine. See the
:func:`read_excel` docs for more details.

@jreback
Copy link
Contributor

jreback commented Dec 1, 2020

Small comment on the whatsnew, but it's perfectly fine to only address this later after the RC as well, it's not a blocker

ideally we do this now, because these docs are important (and small change)

@jreback jreback merged commit b3a3932 into pandas-dev:master Dec 1, 2020
@jreback
Copy link
Contributor

jreback commented Dec 1, 2020

thanks @rhshadrach and @roberthdevries for this, very nice!

we may need some tweeks during the rc but can be later

@rhshadrach
Copy link
Member

Thanks @jreback. Happy to support if issues arise.

@rhshadrach rhshadrach mentioned this pull request Dec 4, 2020
raspbian-autopush pushed a commit to raspbian-packages/pandas that referenced this pull request Dec 20, 2020
xlrd 1.2 fails if defusedxml (needed for odf) is installed

Bug: pandas-dev/pandas#35029
Bug-Debian: https://bugs.debian.org/976620
Origin: upstream b3a3932af6aafaa2fd41f17e9b7995643e5f92eb
Author: Robert de Vries, Rebecca N. Palmer <rebecca_palmer@zoho.com>
Forwarded: not-needed


Gbp-Pq: Name xlrd_976620.patch
raspbian-autopush pushed a commit to raspbian-packages/pandas that referenced this pull request Jan 15, 2021
xlrd 1.2 fails if defusedxml (needed for odf) is installed

Bug: pandas-dev/pandas#35029
Bug-Debian: https://bugs.debian.org/976620
Origin: upstream b3a3932af6aafaa2fd41f17e9b7995643e5f92eb
Author: Robert de Vries, Rebecca N. Palmer <rebecca_palmer@zoho.com>
Forwarded: not-needed


Gbp-Pq: Name xlrd_976620.patch
@roberthdevries roberthdevries deleted the fix-28547-deprecate-xlrd branch March 14, 2022 13:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocker Blocking issue or pull request for an upcoming release Deprecate Functionality to remove in pandas IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deprecate using xlrd engine in favor of openpyxl