Deprecate using `xlrd` engine in favor of openpyxl #28547

WillAyd · 2019-09-20T03:25:00Z

xlrd is unmaintained and the previous maintainer has asked us to move towards openpyxl. xlrd works now, but might have some issues when Python 3.9 or later gets released and changes some elements of the XML parser, as default usage right now throws a PendingDeprecationWarning

Considering that I think we need to deprecate using xlrd in favor of openpyxl. We might not necessarily need to remove the former and it does offer some functionality the latter doesn't (namely reading .xls files) but should at the very least start moving towards the latter

The text was updated successfully, but these errors were encountered:

arpit1997 · 2019-09-20T08:21:27Z

@WillAyd Should we start with adding openpyxl as an engine in pandas.read_excel.

Happy to contribute a PR for it.

WillAyd · 2019-09-20T12:50:51Z

It is already available just need to make it default over time, so want to raise a FutureWarning when the user doesn’t explicitly provide an engine that it will change to openpxyl in a future releaae

jorisvandenbossche · 2019-09-21T13:51:48Z

Would it be an option to simply switch the default, without first raising a warning? Or do the two engines give different results in quite some cases?

Reason I am asking is because if for 99% of the use cases both give exactly the same, raising a warning for all those users feels a bit annoying.

153957 · 2019-09-23T12:40:20Z

The docstrings already require some updating as they currently indicate 'xlrd' is the only option for 'engine'.

153957 · 2019-09-23T19:57:52Z

Would it be an option to simply switch the default, without first raising a warning? Or do the two engines give different results in quite some cases?

It would require installing a different optional package, so a version with deprecation messages/future warnings would be useful.

WillAyd · 2019-09-23T20:02:25Z

Would it be an option to simply switch the default, without first raising a warning? Or do the two engines give different results in quite some cases?

Reason I am asking is because if for 99% of the use cases both give exactly the same, raising a warning for all those users feels a bit annoying.

Just to add - xlrd is AFAIK the only library that can read the legacy .xls format. From experience even ".xlsx" formats aren't as standardized as you'd hope. The openpyxl reader is pretty new so I guess will see through proper deprecation cycle what differences, if any, arise

jorisvandenbossche · 2019-09-23T21:25:09Z

Just to add - xlrd is AFAIK the only library that can read the legacy .xls format.

And to be clear, for those the default would stay xlrd, and this would not be deprecated, right?

So the question about "switching the default" was only for xlsx files.

WillAyd · 2019-09-23T21:30:58Z

Up for debate but for that I think we want to push people towards reading .xlsx files. xlrd is not maintained any more and might break with Python 3.9, so would want to get ahead of that as much as possible

Hiyorimi · 2019-10-09T10:55:16Z

So what is the decision ?

Dump xlrd disregarding .xls support and to replace it with openpyxl ?

WillAyd · 2019-10-09T14:53:14Z

We need to deprecate using xlrd by default. I think it's fine to do for all extensions, including .xls - interesting in a PR?

GallowayJ · 2019-10-17T02:21:30Z

Hi @WillAyd I'm interested in working on this. Is the decision to just raise a FutureWarning for all extensions when no engine is explicitly provided by the user?

WillAyd · 2019-10-17T02:48:02Z

@GallowayJ great - that would be much appreciated! Yes I think let's start with that and see how it looks

GallowayJ · 2019-10-17T03:00:26Z

Okay thanks! I'll get cracking!

Kathakali123 · 2019-10-17T17:45:51Z

@GallowayJ hey , are you working on it ?

GallowayJ · 2019-10-17T18:06:50Z

@Kathakali123 Yeah

roberthdevries · 2020-06-27T14:21:20Z

take

simonjayhawkins · 2020-07-22T11:13:15Z

@TomAugspurger @jreback FYI removing from 1.1 milestone. linked PR isn't milestoned.

fiendish · 2020-11-18T18:17:42Z

Given that people using pandas are often not in control of the data they receive, would it be possible for pandas-dev to patch xlrd's broken use of getiterator?

jreback · 2020-11-18T19:25:29Z

we don’t maintain xlrd at all

i suppose a monkey patch makes it work from the community would be ok

fiendish · 2020-11-18T22:08:34Z

we don’t maintain xlrd at all

Yeah, nobody does :( , so even though this bug is absolutely trivial to fix there's nowhere to submit the patch -> python-excel/xlrd@python-excel:f8371f0...fiendish:e995456

(TBH, ElementTree.iter was introduced in python 2.7, which is the oldest version that xlrd claims to support anyway. I'm not even sure why it bothers looking at getiterator at all)

jreback · 2020-11-18T23:30:33Z

you can try to monkey patch - if it works would be willing to consider a patch

cjw296 · 2020-11-29T18:46:04Z

A year down the line, it's time to see this change made. It's disappointing to see it dropped from milestones when it needlessly results in pain for people trying to read modern Excel files.

For .xlsx, xlrd absolutely positively should not be used, and I say that as the main maintainer of xlrd over the last decade plus.

What proportion of users are still reading data from .xls files (as opposed to .xlsx)? While I feel for these users, they either need to stick on an old version of Python or Pandas, or someone needs to step up and properly maintain xlrd. Nevermind the dangers of using the .xls pseudo-standard that have caused some quite high profile problems of late.

fiendish · 2020-11-29T19:22:59Z

What proportion of users are still reading data from .xls files (as opposed to .xlsx)?

Sadly some of us don't get to pick and choose what data files we work with. That's definitely not your problem to solve, but it is mine so I have to try to defend keeping xlrd alive here. Even the latest version of Excel for mac calls xls a "Common Format".

While I feel for these users, they either need to stick on an old version of Python or Pandas, or someone needs to step up and properly maintain xlrd.

If you want to transfer ownership of the repository to me so that I can make a two line change via s/getiterator/iter (or the more nostalgic patch I linked earlier), I'm happy to make that change and no other changes just to stop the only available option for reading xls files from getting forced into the bin for a terrible reason (I mean the deprecation of getiterator, not your choice to stop maintaining). It seems reasonable to do on the premise that ElementTree.iter has existed since Python 2.7.

But I don't need anyone to "step up and properly maintain" it. I just need it to not stop being an option entirely. If push comes to shove, if someone (including me) can't monkey patch around xlrd's forced obsolescence inside pandas, I can at least keep using my own patched version of xlrd as long as Pandas doesn't work towards dropping xlrd entirely.

fzumstein · 2020-12-06T21:53:34Z

@fiendish you shouldn't need to use a patched version of xlrd when working with the legacy xls format as the issue you fixed on your branch was only an issue when using xlrd with the new xlsx format.

fiendish · 2020-12-06T22:14:11Z

@fzumstein In general you may be right. Our case is a little weird. Some of our files actually come into the code without any file extension at all (don't ask, lol, I'm working on addressing that), so it was nice to have one engine that did both by peeking the contents instead of going by file extension. Relying on being able to tell the difference works only until it doesn't. Maybe a pythonic solution is to try one and then fall back to the other...or maybe pandas should rip off the peek code to choose the engine that way, or maybe I should (ugh. isn't the point of libraries so that I don't have to learn how to peek inside the file to tell what it is?)...but xlrd has worked great and fixing it to be 3.9 compliant was so easy. If xlrd hadn't stopped working in 3.9 would this issue have even been opened at all in the first place?

cjw296 · 2020-12-11T10:31:00Z

@WillAyd - so, a couple of private comments and @fiendish's comment above have made me realise that xlrd still needs to stick around, BUT, only for .xls files:

https://groups.google.com/g/python-excel/c/IRa8IWq_4zk/m/Af8-hrRnAgAJ

What I'd suggest is making xlrd the default engine for .xls files and openpyxl the default for .xlsx files.
If it helps, I've abstracted out the code in xlrd that figures out what the type of a spreadsheet file is here:

https://xlrd.readthedocs.io/en/latest/api.html#xlrd.inspect_format

jorisvandenbossche · 2020-12-11T10:53:42Z

@cjw296 Thanks a lot for that!

What I'd suggest is making xlrd the default engine for .xls files and openpyxl the default for .xlsx files.

I think that's exactly what we ended up doing in #35029. Explicitly specifying engine="xlrd" for anything else than xls files is deprecated, and when reading xlsx files we now default to openpyxl (if it is installed, at least. If only xlrd is installed, we also raise a deprecation warning that people should install openpyxl instead)

jorisvandenbossche · 2020-12-11T10:54:26Z

And additional note, this change will shortly be released in pandas 1.2.0 (the RC is out since yesterday)

cjw296 · 2020-12-11T11:04:44Z

@jorisvandenbossche - great! Just to be super explicit: if you attempt to open anything other than a .xls file with xlrd 2.0.0+, you'll get an exception.

jorisvandenbossche · 2020-12-11T11:22:45Z

Ah, good to know. I think that's certainly fine (it's faster than doing a deprecation first .. but I think many pandas users will already upgrade pandas, without necessarily upgrading xlrd, so our deprecation will still be useful).
But maybe we should then update pandas to check for this (case of xlrd >= 2.0), to raise a more informative error message (otherwise if you have eg only xlrd installed, from the xlrd error message it might seem that pandas does not support a certain format, while it's only xlrd that doesn't support it)

fiendish · 2020-12-11T21:23:27Z

@cjw296 Thank you for doing this! The abstracted detector will definitely help too.

cjw296 · 2020-12-12T14:55:50Z

@jorisvandenbossche - I'd forgotten how depressing it could be to interact with users of xlrd, so yes, please could you add super simple and explicit instructions, perhaps along the lines of my stack overflow answer.

I mean, I don't know whether to laugh and assume @LinqLover was deliberately trolling me or cry at the bitter irony of the commit they chose to comment on...

pandas-dev/pandas#28547 (comment)

WillAyd added Deprecate Functionality to remove in pandas good first issue IO Excel read_excel, to_excel labels Sep 20, 2019

WillAyd added this to the Contributions Welcome milestone Sep 20, 2019

cruzzoe added a commit to cruzzoe/pandas that referenced this issue Nov 2, 2019

pandas-dev#28547 - deprecate xlrd - FutureWarnings

bd59d77

cruzzoe added a commit to cruzzoe/pandas that referenced this issue Nov 2, 2019

pandas-dev#28547 - deprecate xlrd - FutureWarnings

5c32f95

cruzzoe mentioned this issue Nov 2, 2019

Deprecate using xlrd engine #29375

Closed

5 tasks

cruzzoe added a commit to cruzzoe/pandas that referenced this issue Jan 26, 2020

Add in deprecation msg for (pandas-dev#28547)

3073f5a

cruzzoe added a commit to cruzzoe/pandas that referenced this issue Jan 26, 2020

fixup! Add in deprecation msg for (pandas-dev#28547)

a1bb870

aloukina mentioned this issue Jan 30, 2020

Deprecation warning for xlsx EducationalTestingService/rsmtool#345

Closed

JS3xton mentioned this issue May 6, 2020

Migrate pandas read_excel engine from xlrd to openpyxl. taborlab/FlowCal#325

Merged

github-actions bot assigned roberthdevries Jun 27, 2020

roberthdevries mentioned this issue Jun 27, 2020

DEPR: Deprecate using xlrd engine for read_excel #35029

Merged

5 tasks

roberthdevries pushed a commit to roberthdevries/pandas that referenced this issue Jun 29, 2020

Add in deprecation msg for (pandas-dev#28547)

d14175c

jreback modified the milestones: Contributions Welcome, 1.1 Jul 2, 2020

simonjayhawkins removed this from the 1.1 milestone Jul 22, 2020

fzumstein mentioned this issue Jul 27, 2020

ENH: Auto Detect engine in read_excel #35416

Closed

3 tasks

jorisvandenbossche added this to the 1.2 milestone Oct 27, 2020

jorisvandenbossche mentioned this issue Oct 27, 2020

xlrd is required for handle excel in pandas but it is EOL and yield warnings on py37 #37438

Closed

twoertwein mentioned this issue Nov 13, 2020

getiterator deprecated in Python 3.9; failure to call pd.read_excel() #37795

Closed

jreback closed this as completed in #35029 Dec 1, 2020

jorisvandenbossche mentioned this issue Dec 12, 2020

shift default excel read engine from xlrd to openpyxl #38424

Closed

jorisvandenbossche mentioned this issue Dec 22, 2020

Better inference of spreadsheet formats. #38522

Closed

dgdekoning mentioned this issue Jan 12, 2021

Doing absolutely the wrong thing until upstream is fixed LCA-ActivityBrowser/activity-browser#497

Merged

dgdekoning added a commit to LCA-ActivityBrowser/activity-browser that referenced this issue Jan 12, 2021

Doing absolutely the wrong thing until upstream is fixed (#497)

5da3231

pandas-dev/pandas#28547 (comment)

dietmar mentioned this issue Oct 12, 2022

[ENH] - Add openpyxl to scipy-notebook to support .xlsx Excel files jupyter/docker-stacks#1804

Closed

rhshadrach mentioned this issue Oct 28, 2022

CLN: Remove xlrd < 2.0 code #49376

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deprecate using `xlrd` engine in favor of openpyxl #28547

Deprecate using `xlrd` engine in favor of openpyxl #28547

WillAyd commented Sep 20, 2019

arpit1997 commented Sep 20, 2019

WillAyd commented Sep 20, 2019

jorisvandenbossche commented Sep 21, 2019

153957 commented Sep 23, 2019

153957 commented Sep 23, 2019

WillAyd commented Sep 23, 2019

jorisvandenbossche commented Sep 23, 2019

WillAyd commented Sep 23, 2019

Hiyorimi commented Oct 9, 2019

WillAyd commented Oct 9, 2019

GallowayJ commented Oct 17, 2019

WillAyd commented Oct 17, 2019

GallowayJ commented Oct 17, 2019

Kathakali123 commented Oct 17, 2019

GallowayJ commented Oct 17, 2019

roberthdevries commented Jun 27, 2020

simonjayhawkins commented Jul 22, 2020

fiendish commented Nov 18, 2020

jreback commented Nov 18, 2020

fiendish commented Nov 18, 2020 •

edited

Loading

jreback commented Nov 18, 2020

cjw296 commented Nov 29, 2020 •

edited

Loading

fiendish commented Nov 29, 2020 •

edited

Loading

fzumstein commented Dec 6, 2020

fiendish commented Dec 6, 2020 •

edited

Loading

cjw296 commented Dec 11, 2020

jorisvandenbossche commented Dec 11, 2020

jorisvandenbossche commented Dec 11, 2020

cjw296 commented Dec 11, 2020 •

edited

Loading

jorisvandenbossche commented Dec 11, 2020

fiendish commented Dec 11, 2020 •

edited

Loading

cjw296 commented Dec 12, 2020 •

edited

Loading

Deprecate using xlrd engine in favor of openpyxl #28547

Deprecate using xlrd engine in favor of openpyxl #28547

Comments

WillAyd commented Sep 20, 2019

arpit1997 commented Sep 20, 2019

WillAyd commented Sep 20, 2019

jorisvandenbossche commented Sep 21, 2019

153957 commented Sep 23, 2019

153957 commented Sep 23, 2019

WillAyd commented Sep 23, 2019

jorisvandenbossche commented Sep 23, 2019

WillAyd commented Sep 23, 2019

Hiyorimi commented Oct 9, 2019

WillAyd commented Oct 9, 2019

GallowayJ commented Oct 17, 2019

WillAyd commented Oct 17, 2019

GallowayJ commented Oct 17, 2019

Kathakali123 commented Oct 17, 2019

GallowayJ commented Oct 17, 2019

roberthdevries commented Jun 27, 2020

simonjayhawkins commented Jul 22, 2020

fiendish commented Nov 18, 2020

jreback commented Nov 18, 2020

fiendish commented Nov 18, 2020 • edited Loading

jreback commented Nov 18, 2020

cjw296 commented Nov 29, 2020 • edited Loading

fiendish commented Nov 29, 2020 • edited Loading

fzumstein commented Dec 6, 2020

fiendish commented Dec 6, 2020 • edited Loading

cjw296 commented Dec 11, 2020

jorisvandenbossche commented Dec 11, 2020

jorisvandenbossche commented Dec 11, 2020

cjw296 commented Dec 11, 2020 • edited Loading

jorisvandenbossche commented Dec 11, 2020

fiendish commented Dec 11, 2020 • edited Loading

cjw296 commented Dec 12, 2020 • edited Loading

Deprecate using `xlrd` engine in favor of openpyxl #28547

Deprecate using `xlrd` engine in favor of openpyxl #28547

fiendish commented Nov 18, 2020 •

edited

Loading

cjw296 commented Nov 29, 2020 •

edited

Loading

fiendish commented Nov 29, 2020 •

edited

Loading

fiendish commented Dec 6, 2020 •

edited

Loading

cjw296 commented Dec 11, 2020 •

edited

Loading

fiendish commented Dec 11, 2020 •

edited

Loading

cjw296 commented Dec 12, 2020 •

edited

Loading