Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate using xlrd engine in favor of openpyxl #28547

Closed
WillAyd opened this issue Sep 20, 2019 · 32 comments · Fixed by #35029
Closed

Deprecate using xlrd engine in favor of openpyxl #28547

WillAyd opened this issue Sep 20, 2019 · 32 comments · Fixed by #35029
Assignees
Labels
Deprecate Functionality to remove in pandas good first issue IO Excel read_excel, to_excel
Milestone

Comments

@WillAyd
Copy link
Member

WillAyd commented Sep 20, 2019

xlrd is unmaintained and the previous maintainer has asked us to move towards openpyxl. xlrd works now, but might have some issues when Python 3.9 or later gets released and changes some elements of the XML parser, as default usage right now throws a PendingDeprecationWarning

Considering that I think we need to deprecate using xlrd in favor of openpyxl. We might not necessarily need to remove the former and it does offer some functionality the latter doesn't (namely reading .xls files) but should at the very least start moving towards the latter

@WillAyd WillAyd added Deprecate Functionality to remove in pandas good first issue IO Excel read_excel, to_excel labels Sep 20, 2019
@WillAyd WillAyd added this to the Contributions Welcome milestone Sep 20, 2019
@arpit1997
Copy link

@WillAyd Should we start with adding openpyxl as an engine in pandas.read_excel.

Happy to contribute a PR for it.

@WillAyd
Copy link
Member Author

WillAyd commented Sep 20, 2019

It is already available just need to make it default over time, so want to raise a FutureWarning when the user doesn’t explicitly provide an engine that it will change to openpxyl in a future releaae

@jorisvandenbossche
Copy link
Member

Would it be an option to simply switch the default, without first raising a warning? Or do the two engines give different results in quite some cases?

Reason I am asking is because if for 99% of the use cases both give exactly the same, raising a warning for all those users feels a bit annoying.

@153957
Copy link

153957 commented Sep 23, 2019

The docstrings already require some updating as they currently indicate 'xlrd' is the only option for 'engine'.

@153957
Copy link

153957 commented Sep 23, 2019

Would it be an option to simply switch the default, without first raising a warning? Or do the two engines give different results in quite some cases?

It would require installing a different optional package, so a version with deprecation messages/future warnings would be useful.

@WillAyd
Copy link
Member Author

WillAyd commented Sep 23, 2019

Would it be an option to simply switch the default, without first raising a warning? Or do the two engines give different results in quite some cases?

Reason I am asking is because if for 99% of the use cases both give exactly the same, raising a warning for all those users feels a bit annoying.

Just to add - xlrd is AFAIK the only library that can read the legacy .xls format. From experience even ".xlsx" formats aren't as standardized as you'd hope. The openpyxl reader is pretty new so I guess will see through proper deprecation cycle what differences, if any, arise

@jorisvandenbossche
Copy link
Member

Just to add - xlrd is AFAIK the only library that can read the legacy .xls format.

And to be clear, for those the default would stay xlrd, and this would not be deprecated, right?

So the question about "switching the default" was only for xlsx files.

@WillAyd
Copy link
Member Author

WillAyd commented Sep 23, 2019

Up for debate but for that I think we want to push people towards reading .xlsx files. xlrd is not maintained any more and might break with Python 3.9, so would want to get ahead of that as much as possible

@Hiyorimi
Copy link

Hiyorimi commented Oct 9, 2019

So what is the decision ?

Dump xlrd disregarding .xls support and to replace it with openpyxl ?

@WillAyd
Copy link
Member Author

WillAyd commented Oct 9, 2019

We need to deprecate using xlrd by default. I think it's fine to do for all extensions, including .xls - interesting in a PR?

@GallowayJ
Copy link

Hi @WillAyd I'm interested in working on this. Is the decision to just raise a FutureWarning for all extensions when no engine is explicitly provided by the user?

@WillAyd
Copy link
Member Author

WillAyd commented Oct 17, 2019

@GallowayJ great - that would be much appreciated! Yes I think let's start with that and see how it looks

@GallowayJ
Copy link

Okay thanks! I'll get cracking!

@Kathakali123
Copy link

@GallowayJ hey , are you working on it ?

@GallowayJ
Copy link

@Kathakali123 Yeah

@roberthdevries
Copy link
Contributor

take

@jreback jreback modified the milestones: Contributions Welcome, 1.1 Jul 2, 2020
@simonjayhawkins
Copy link
Member

@TomAugspurger @jreback FYI removing from 1.1 milestone. linked PR isn't milestoned.

@fiendish
Copy link
Contributor

Given that people using pandas are often not in control of the data they receive, would it be possible for pandas-dev to patch xlrd's broken use of getiterator?

@jreback
Copy link
Contributor

jreback commented Nov 18, 2020

we don’t maintain xlrd at all

i suppose a monkey patch makes it work from the community would be ok

@fiendish
Copy link
Contributor

fiendish commented Nov 18, 2020

we don’t maintain xlrd at all

Yeah, nobody does :( , so even though this bug is absolutely trivial to fix there's nowhere to submit the patch -> python-excel/xlrd@python-excel:f8371f0...fiendish:e995456

(TBH, ElementTree.iter was introduced in python 2.7, which is the oldest version that xlrd claims to support anyway. I'm not even sure why it bothers looking at getiterator at all)

@jreback
Copy link
Contributor

jreback commented Nov 18, 2020

you can try to monkey patch - if it works would be willing to consider a patch

@cjw296
Copy link
Contributor

cjw296 commented Nov 29, 2020

A year down the line, it's time to see this change made. It's disappointing to see it dropped from milestones when it needlessly results in pain for people trying to read modern Excel files.

For .xlsx, xlrd absolutely positively should not be used, and I say that as the main maintainer of xlrd over the last decade plus.

What proportion of users are still reading data from .xls files (as opposed to .xlsx)? While I feel for these users, they either need to stick on an old version of Python or Pandas, or someone needs to step up and properly maintain xlrd. Nevermind the dangers of using the .xls pseudo-standard that have caused some quite high profile problems of late.

@fiendish
Copy link
Contributor

fiendish commented Nov 29, 2020

What proportion of users are still reading data from .xls files (as opposed to .xlsx)?

Sadly some of us don't get to pick and choose what data files we work with. That's definitely not your problem to solve, but it is mine so I have to try to defend keeping xlrd alive here. Even the latest version of Excel for mac calls xls a "Common Format".

While I feel for these users, they either need to stick on an old version of Python or Pandas, or someone needs to step up and properly maintain xlrd.

If you want to transfer ownership of the repository to me so that I can make a two line change via s/getiterator/iter (or the more nostalgic patch I linked earlier), I'm happy to make that change and no other changes just to stop the only available option for reading xls files from getting forced into the bin for a terrible reason (I mean the deprecation of getiterator, not your choice to stop maintaining). It seems reasonable to do on the premise that ElementTree.iter has existed since Python 2.7.

But I don't need anyone to "step up and properly maintain" it. I just need it to not stop being an option entirely. If push comes to shove, if someone (including me) can't monkey patch around xlrd's forced obsolescence inside pandas, I can at least keep using my own patched version of xlrd as long as Pandas doesn't work towards dropping xlrd entirely.

@fzumstein
Copy link

@fiendish you shouldn't need to use a patched version of xlrd when working with the legacy xls format as the issue you fixed on your branch was only an issue when using xlrd with the new xlsx format.

@fiendish
Copy link
Contributor

fiendish commented Dec 6, 2020

@fzumstein In general you may be right. Our case is a little weird. Some of our files actually come into the code without any file extension at all (don't ask, lol, I'm working on addressing that), so it was nice to have one engine that did both by peeking the contents instead of going by file extension. Relying on being able to tell the difference works only until it doesn't. Maybe a pythonic solution is to try one and then fall back to the other...or maybe pandas should rip off the peek code to choose the engine that way, or maybe I should (ugh. isn't the point of libraries so that I don't have to learn how to peek inside the file to tell what it is?)...but xlrd has worked great and fixing it to be 3.9 compliant was so easy. If xlrd hadn't stopped working in 3.9 would this issue have even been opened at all in the first place?

@cjw296
Copy link
Contributor

cjw296 commented Dec 11, 2020

@WillAyd - so, a couple of private comments and @fiendish's comment above have made me realise that xlrd still needs to stick around, BUT, only for .xls files:

https://groups.google.com/g/python-excel/c/IRa8IWq_4zk/m/Af8-hrRnAgAJ

What I'd suggest is making xlrd the default engine for .xls files and openpyxl the default for .xlsx files.
If it helps, I've abstracted out the code in xlrd that figures out what the type of a spreadsheet file is here:

https://xlrd.readthedocs.io/en/latest/api.html#xlrd.inspect_format

@jorisvandenbossche
Copy link
Member

@cjw296 Thanks a lot for that!

What I'd suggest is making xlrd the default engine for .xls files and openpyxl the default for .xlsx files.

I think that's exactly what we ended up doing in #35029. Explicitly specifying engine="xlrd" for anything else than xls files is deprecated, and when reading xlsx files we now default to openpyxl (if it is installed, at least. If only xlrd is installed, we also raise a deprecation warning that people should install openpyxl instead)

@jorisvandenbossche
Copy link
Member

And additional note, this change will shortly be released in pandas 1.2.0 (the RC is out since yesterday)

@cjw296
Copy link
Contributor

cjw296 commented Dec 11, 2020

@jorisvandenbossche - great! Just to be super explicit: if you attempt to open anything other than a .xls file with xlrd 2.0.0+, you'll get an exception.

@jorisvandenbossche
Copy link
Member

Ah, good to know. I think that's certainly fine (it's faster than doing a deprecation first .. but I think many pandas users will already upgrade pandas, without necessarily upgrading xlrd, so our deprecation will still be useful).
But maybe we should then update pandas to check for this (case of xlrd >= 2.0), to raise a more informative error message (otherwise if you have eg only xlrd installed, from the xlrd error message it might seem that pandas does not support a certain format, while it's only xlrd that doesn't support it)

@fiendish
Copy link
Contributor

fiendish commented Dec 11, 2020

@cjw296 Thank you for doing this! The abstracted detector will definitely help too.

@cjw296
Copy link
Contributor

cjw296 commented Dec 12, 2020

@jorisvandenbossche - I'd forgotten how depressing it could be to interact with users of xlrd, so yes, please could you add super simple and explicit instructions, perhaps along the lines of my stack overflow answer.

I mean, I don't know whether to laugh and assume @LinqLover was deliberately trolling me or cry at the bitter irony of the commit they chose to comment on...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas good first issue IO Excel read_excel, to_excel
Projects
None yet