-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deprecate using xlrd
engine in favor of openpyxl
#28547
Comments
@WillAyd Should we start with adding Happy to contribute a PR for it. |
It is already available just need to make it default over time, so want to raise a FutureWarning when the user doesn’t explicitly provide an engine that it will change to openpxyl in a future releaae |
Would it be an option to simply switch the default, without first raising a warning? Or do the two engines give different results in quite some cases? Reason I am asking is because if for 99% of the use cases both give exactly the same, raising a warning for all those users feels a bit annoying. |
The docstrings already require some updating as they currently indicate |
It would require installing a different optional package, so a version with deprecation messages/future warnings would be useful. |
Just to add - xlrd is AFAIK the only library that can read the legacy .xls format. From experience even ".xlsx" formats aren't as standardized as you'd hope. The openpyxl reader is pretty new so I guess will see through proper deprecation cycle what differences, if any, arise |
And to be clear, for those the default would stay xlrd, and this would not be deprecated, right? So the question about "switching the default" was only for xlsx files. |
Up for debate but for that I think we want to push people towards reading .xlsx files. xlrd is not maintained any more and might break with Python 3.9, so would want to get ahead of that as much as possible |
So what is the decision ? Dump xlrd disregarding .xls support and to replace it with openpyxl ? |
We need to deprecate using xlrd by default. I think it's fine to do for all extensions, including .xls - interesting in a PR? |
Hi @WillAyd I'm interested in working on this. Is the decision to just raise a FutureWarning for all extensions when no engine is explicitly provided by the user? |
@GallowayJ great - that would be much appreciated! Yes I think let's start with that and see how it looks |
Okay thanks! I'll get cracking! |
@GallowayJ hey , are you working on it ? |
@Kathakali123 Yeah |
take |
@TomAugspurger @jreback FYI removing from 1.1 milestone. linked PR isn't milestoned. |
Given that people using pandas are often not in control of the data they receive, would it be possible for pandas-dev to patch xlrd's broken use of getiterator? |
we don’t maintain xlrd at all i suppose a monkey patch makes it work from the community would be ok |
Yeah, nobody does :( , so even though this bug is absolutely trivial to fix there's nowhere to submit the patch -> python-excel/xlrd@python-excel:f8371f0...fiendish:e995456 (TBH, ElementTree.iter was introduced in python 2.7, which is the oldest version that xlrd claims to support anyway. I'm not even sure why it bothers looking at getiterator at all) |
you can try to monkey patch - if it works would be willing to consider a patch |
A year down the line, it's time to see this change made. It's disappointing to see it dropped from milestones when it needlessly results in pain for people trying to read modern Excel files. For What proportion of users are still reading data from |
Sadly some of us don't get to pick and choose what data files we work with. That's definitely not your problem to solve, but it is mine so I have to try to defend keeping xlrd alive here. Even the latest version of Excel for mac calls xls a "Common Format".
If you want to transfer ownership of the repository to me so that I can make a two line change via s/getiterator/iter (or the more nostalgic patch I linked earlier), I'm happy to make that change and no other changes just to stop the only available option for reading xls files from getting forced into the bin for a terrible reason (I mean the deprecation of getiterator, not your choice to stop maintaining). It seems reasonable to do on the premise that ElementTree.iter has existed since Python 2.7. But I don't need anyone to "step up and properly maintain" it. I just need it to not stop being an option entirely. If push comes to shove, if someone (including me) can't monkey patch around xlrd's forced obsolescence inside pandas, I can at least keep using my own patched version of xlrd as long as Pandas doesn't work towards dropping xlrd entirely. |
@fiendish you shouldn't need to use a patched version of xlrd when working with the legacy |
@fzumstein In general you may be right. Our case is a little weird. Some of our files actually come into the code without any file extension at all (don't ask, lol, I'm working on addressing that), so it was nice to have one engine that did both by peeking the contents instead of going by file extension. Relying on being able to tell the difference works only until it doesn't. Maybe a pythonic solution is to try one and then fall back to the other...or maybe pandas should rip off the peek code to choose the engine that way, or maybe I should (ugh. isn't the point of libraries so that I don't have to learn how to peek inside the file to tell what it is?)...but xlrd has worked great and fixing it to be 3.9 compliant was so easy. If xlrd hadn't stopped working in 3.9 would this issue have even been opened at all in the first place? |
@WillAyd - so, a couple of private comments and @fiendish's comment above have made me realise that xlrd still needs to stick around, BUT, only for https://groups.google.com/g/python-excel/c/IRa8IWq_4zk/m/Af8-hrRnAgAJ What I'd suggest is making https://xlrd.readthedocs.io/en/latest/api.html#xlrd.inspect_format |
@cjw296 Thanks a lot for that!
I think that's exactly what we ended up doing in #35029. Explicitly specifying |
And additional note, this change will shortly be released in pandas 1.2.0 (the RC is out since yesterday) |
@jorisvandenbossche - great! Just to be super explicit: if you attempt to open anything other than a |
Ah, good to know. I think that's certainly fine (it's faster than doing a deprecation first .. but I think many pandas users will already upgrade pandas, without necessarily upgrading xlrd, so our deprecation will still be useful). |
@cjw296 Thank you for doing this! The abstracted detector will definitely help too. |
@jorisvandenbossche - I'd forgotten how depressing it could be to interact with users of xlrd, so yes, please could you add super simple and explicit instructions, perhaps along the lines of my stack overflow answer. I mean, I don't know whether to laugh and assume @LinqLover was deliberately trolling me or cry at the bitter irony of the commit they chose to comment on... |
xlrd is unmaintained and the previous maintainer has asked us to move towards openpyxl. xlrd works now, but might have some issues when Python 3.9 or later gets released and changes some elements of the XML parser, as default usage right now throws a
PendingDeprecationWarning
Considering that I think we need to deprecate using
xlrd
in favor ofopenpyxl
. We might not necessarily need to remove the former and it does offer some functionality the latter doesn't (namely reading.xls
files) but should at the very least start moving towards the latterThe text was updated successfully, but these errors were encountered: