-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DEPR: Deprecate using xlrd
engine for read_excel
#35029
DEPR: Deprecate using xlrd
engine for read_excel
#35029
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @roberthdevries
Can you use the method described here https://pandas.pydata.org/pandas-docs/stable/development/contributing.html#testing-warnings to test for warnings?
pandas/io/excel/_base.py
Outdated
@@ -852,6 +853,14 @@ def __init__(self, path_or_buffer, engine=None): | |||
ext = os.path.splitext(str(path_or_buffer))[-1] | |||
if ext == ".ods": | |||
engine = "odf" | |||
|
|||
if engine == "xlrd": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The warning should not be issued when the parameter engine="xlrd"
is passed explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, if the engine is deprecated, I would expect that all uses should be discouraged. Explicit or implicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xlrd is the only thing that will read legacy .xls files unfortunately, so I don't think we need to outright remove all usage of it but want the default to switch to openpyxl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So are we already switching to openpyxl for everything other than .xls files (except of course .ods files and maybe .xlsb files)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea xlsx and xlsm files (the former I would hope is what the vast majority of people read nowadays)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now changed the default engine to openpyxl and added a check to use xlrd for .xls files
This required quite some changes to the tests and a work-around for a rounding error in openpyxl.
See https://foss.heptapod.net/openpyxl/openpyxl/-/issues/1493
9e6474d
to
40fbf53
Compare
pandas/io/excel/_base.py
Outdated
@@ -844,14 +844,24 @@ class ExcelFile: | |||
|
|||
def __init__(self, path_or_buffer, engine=None): | |||
if engine is None: | |||
engine = "xlrd" | |||
engine = "openpyxl" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this actually changes the engine; I think the first step is to provide the warning that you have below that by default in the future we will switch to using openpyxl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making an exception for .xls files which have to use xlrd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that means that a warning shall only be produced for .xlsx and .xlsm files that use xlrd?
And regarding the other remark about switching the engines, that was what you asked for a couple of comments back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right - warn first then change over time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the change to use the openpyxl engine as the default has to be reverted? Or it is just that the xlrd engine is going to be removed in the future altogether and with that the support for .xls files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd Should I revert the change to make openpyxl the default engine?
And only warn when using xlrd in combination with .xlsx or .xlsm files?
8ed0652
to
fad02a5
Compare
Yea should warn first can actually change in 2.0
…Sent from my iPhone
On Jul 3, 2020, at 1:25 PM, Robert de Vries ***@***.***> wrote:
@roberthdevries commented on this pull request.
In pandas/io/excel/_base.py:
> @@ -844,14 +844,24 @@ class ExcelFile:
def __init__(self, path_or_buffer, engine=None):
if engine is None:
- engine = "xlrd"
+ engine = "openpyxl"
@WillAyd Should I revert the change to make openpyxl the default engine?
And only warn when using xlrd in combination with .xlsx or .xlsm files?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
@roberthdevries canyou update to comments and merge master |
@roberthdevries can you address comments? |
Not until I am back from vacation in three weeks. |
fad02a5
to
45e8193
Compare
I have addressed the remaining comment from @WillAyd to only warn about the pending deprecation. |
doc/source/whatsnew/v1.2.0.rst
Outdated
@@ -143,6 +143,7 @@ See :ref:`install.dependencies` and :ref:`install.optional_dependencies` for mor | |||
Deprecations | |||
~~~~~~~~~~~~ | |||
- Deprecated parameter ``inplace`` in :meth:`MultiIndex.set_codes` and :meth:`MultiIndex.set_levels` (:issue:`35626`) | |||
- :func:`read_excel` "xlrd" engine is deprecated for all file types that can be handled by "openpyxl" because "xlrd" is no longer maintained (:issue:`28547`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you reword for the benefit end users.
i.e. the default engine for read_excel
is changing the the future
maybe say something like openpyxl is the recommended engine as xlrd is no longer maintained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pandas/io/excel/_openpyxl.py
Outdated
try: | ||
# workaround for inaccurate timestamp notation in excel | ||
return datetime.fromtimestamp(round(cell.value.timestamp())) | ||
except (AttributeError, OSError): | ||
return cell.value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this changing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a work-around for a bug in openpyxl (https://foss.heptapod.net/openpyxl/openpyxl/-/issues/1493), but is only apparent when you do a round trip save to xlsx and read back xlsx using openpyxl.
As this is not tested in any unit test, this can be removed. Agreed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If adding code we should have a test. so either need to add test or can remove.
my preference would be to have this in a separate PR, so should raise pandas issue for this if removing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, remove it here and make a separate PR that includes a regression test.
f740ed7
to
081ecf8
Compare
doc/source/whatsnew/v1.2.0.rst
Outdated
@@ -144,6 +144,8 @@ Deprecations | |||
~~~~~~~~~~~~ | |||
- Deprecated parameter ``inplace`` in :meth:`MultiIndex.set_codes` and :meth:`MultiIndex.set_levels` (:issue:`35626`) | |||
- Deprecated parameter ``dtype`` in :~meth:`Index.copy` on method all index classes. Use the :meth:`Index.astype` method instead for changing dtype(:issue:`35853`) | |||
- :func:`read_excel` "xlrd" engine is deprecated. The recommended engine is "openpyxl" for "xlsx" and "xlsm" files, because "xlrd" is no longer maintained (:issue:`28547`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for updating. it's the ``read_excel "xlrd" engine is deprecated
bit that I wanted removed
IIUC the xlrd is not deprecated. it's only that that default engine used will change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it deprecated for use with xlsx files? IOW in the future, will only openpyxl be supported for xlsx? Sounds reasonable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, that's not my understanding of the discussion in #28547
from #28547 (comment)
Considering that I think we need to deprecate using xlrd in favor of openpyxl. We might not necessarily need to remove the former and it does offer some functionality the latter doesn't (namely reading .xls files) but should at the very least start moving towards the latter
Indeed, the discussion has not explicitly mentioned disallowing xlrd
for formats that openpyxl
supports. but if the xlrd engine is not removed, we should decide now whether we would restrict it's use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My read of this is to only keep xlrd for xls where it is required, and to deprecate where it is not. In the long run, if xlrd breaks and no one takes over its maintenance, then we will either have to vendor xlrd or remove support for xls. In either case minimizing use of xlrd seems like a good idea to me.
I suppose to me the only point of deprecating is to start a path to removal of a feature. If there is not going to be removal in the future, then why bother with deprecation?
One could always discourage xlrd + xlsx with a noisy FutureWarning telling users that xlrd is unmaintained and they should install openpyxl for reading xlsx files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd is very specific regarding this issue. He states that we should warn first, then change the default (and maybe even remove the xlrd engine for xlsx and xlsm files altogether).
But not at this point, only a deprecation warning is asked, to notify users that this engine is no longer the preferred engine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify we should only be changing the default reader to openpyxl. I think it's fine to keep xlrd around as a YMMV situation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I just reverted the changes that made openpyxl the default. I am now very confused. I thought that this change was just about warning people about pending deprecation of the xlrd reader, not to switch them over already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See your comments of July 1 and 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see where the wording is confusing, but yes we only warn now and change in the future. We always manage user-facing changes that way
pandas/io/excel/_base.py
Outdated
engine = "odf" | ||
|
||
elif engine == "xlrd" and ext in ("xlsx", "xlsm"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This warning should be in the if engine is None
branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure? This will mean that that people also get a warning when they ask for the default (which is still xlrd), instead of when they explicitly ask for xlrd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea so the point of it is that people who want to suppress the warnings will get a head start and explicitly request engine="openpyxl"
, which is a good thing to sniff out any bugs
just because I didn't see it mentioned above, I thought I'd mention that |
Interesting, what did you base this on? |
Testing based on files I'm working on. After seeing the deprecation notice, we switched our code, and found out our CI was taking one hour longer (!). We traced it back to |
@roberthdevries This branch has conflicts that must be resolved. can you also address @WillAyd comments. #35029 (review) |
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
Changes made. I've adjusted the tests only as necessary to either (i) test for openpyxl as being the default or (ii) specified engine="xlrd" if openpyxl would fail and either it seemed the test was specifically for xlrd or there was no clear way to resolve the test otherwise. I don't know how to write a test for the FutureWarning - in testing environment, I think openpyxl will always be installed and so it will never be raised. As such, I've simply removed the tests for the FutureWarning for now. As mentioned above, I won't be available until 16:00 EST (21:00 UTC). Anyone is of course welcomed to push this over the finish line. |
we don't have either installed in the numpy dev environment so can certainly add some basic tests that run (likely we skip everything if we don't have either installed) |
Linux py37_minimum_versions failure is |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
@jreback tests pass; two failures below on Linux py38_np_dev are unrelated and passed on the previous run.
|
these tests are failing on other PRs too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just some formatting considerations. ping when pushed as ok for merge. cc @jorisvandenbossche
doc/source/whatsnew/v1.2.0.rst
Outdated
.. warning:: | ||
|
||
Previously, the default argument ``engine=None`` to ``pd.read_excel`` | ||
would result in using the xlrd engine in many cases. The engine xlrd is no longer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double back-tick on xlrd (alt can put a link to xlrd itself, e.g. https://xlrd.readthedocs.io/en/latest/)
doc/source/whatsnew/v1.2.0.rst
Outdated
following logic is now used to determine the engine. | ||
|
||
- If ``path_or_buffer`` is an OpenDocument format (.odf, .ods, .odt), then odf will be used. | ||
- Otherwise if ``path_or_buffer`` is a bytes stream, the file has the extension ``.xls``, or is an xlrd Book instance, then xlrd will be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double backtick xlrd / odf (only put the docs link on L14)
pandas/io/excel/_base.py
Outdated
- "xlrd" supports most old/new Excel file formats. | ||
- "openpyxl" supports newer Excel file formats. | ||
- "odf" supports OpenDocument file formats (.odf, .ods, .odt). | ||
- "pyxlsb" supports Binary Excel files. | ||
|
||
.. versionchanged:: 1.2.0 | ||
The engine xlrd is no longer maintained, and is not supported with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think need a blank line here to render (make this section the same as in the whatsnew as per formatting)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, for versionchanged this is OK (rst .. ;-))
doc/source/whatsnew/v1.2.0.rst
Outdated
maintained, and is not supported with python >= 3.9. When ``engine=None``, the | ||
following logic is now used to determine the engine. | ||
|
||
- If ``path_or_buffer`` is an OpenDocument format (.odf, .ods, .odt), then odf will be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
link for odf: https://pypi.org/project/odfpy/
doc/source/whatsnew/v1.2.0.rst
Outdated
|
||
- If ``path_or_buffer`` is an OpenDocument format (.odf, .ods, .odt), then odf will be used. | ||
- Otherwise if ``path_or_buffer`` is a bytes stream, the file has the extension ``.xls``, or is an xlrd Book instance, then xlrd will be used. | ||
- Otherwise if openpyxl is installed, then openpyxl will be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python >= 3.9. When ``engine=None``, the following logic will be | ||
used to determine the engine. | ||
|
||
- If ``path_or_buffer`` is an OpenDocument format (.odf, .ods, .odt), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
obviously as much of the formatting you can do here as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small comment on the whatsnew, but it's perfectly fine to only address this later after the RC as well, it's not a blocker
pandas/io/excel/_base.py
Outdated
- "xlrd" supports most old/new Excel file formats. | ||
- "openpyxl" supports newer Excel file formats. | ||
- "odf" supports OpenDocument file formats (.odf, .ods, .odt). | ||
- "pyxlsb" supports Binary Excel files. | ||
|
||
.. versionchanged:: 1.2.0 | ||
The engine xlrd is no longer maintained, and is not supported with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, for versionchanged this is OK (rst .. ;-))
doc/source/whatsnew/v1.2.0.rst
Outdated
- If ``path_or_buffer`` is an OpenDocument format (.odf, .ods, .odt), then odf will be used. | ||
- Otherwise if ``path_or_buffer`` is a bytes stream, the file has the extension ``.xls``, or is an xlrd Book instance, then xlrd will be used. | ||
- Otherwise if openpyxl is installed, then openpyxl will be used. | ||
- Otherwise xlrd will be used and a ``FutureWarning`` will be raised. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would maybe rearrange this list: the most important piece of information we want to convey here is that for xlsx files the default changed from xlrd to openpyxl, if installed. So I would also put that on top of the list (or keep it just to this for the whatsnew, as the other items didn't change. The full list is still in the actual docs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't actually look at the extension or the file format when determining the engine in various cases, so it isn't just changing for xlsx files, right? What do you think of this:
Previously, the default argument ``engine=None`` to ``pd.read_excel``
would result in using the `xlrd <https://xlrd.readthedocs.io/en/latest/>`_ engine
in many cases. The engine ``xlrd`` is no longer
maintained, and is not supported with python >= 3.9. If
`openpyxl <https://pypi.org/project/openpyxl/>`_ is installed, many of these
cases will now default to using the ``openpyxl`` engine. See the
:func:`read_excel` docs for more details.
ideally we do this now, because these docs are important (and small change) |
thanks @rhshadrach and @roberthdevries for this, very nice! we may need some tweeks during the rc but can be later |
Thanks @jreback. Happy to support if issues arise. |
xlrd 1.2 fails if defusedxml (needed for odf) is installed Bug: pandas-dev/pandas#35029 Bug-Debian: https://bugs.debian.org/976620 Origin: upstream b3a3932af6aafaa2fd41f17e9b7995643e5f92eb Author: Robert de Vries, Rebecca N. Palmer <rebecca_palmer@zoho.com> Forwarded: not-needed Gbp-Pq: Name xlrd_976620.patch
xlrd 1.2 fails if defusedxml (needed for odf) is installed Bug: pandas-dev/pandas#35029 Bug-Debian: https://bugs.debian.org/976620 Origin: upstream b3a3932af6aafaa2fd41f17e9b7995643e5f92eb Author: Robert de Vries, Rebecca N. Palmer <rebecca_palmer@zoho.com> Forwarded: not-needed Gbp-Pq: Name xlrd_976620.patch
xlrd
engine in favor of openpyxl #28547black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
This is MR #29375 but rebased to master