-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade to xlrd 2.0.0 + openpyxl #3191
Conversation
I implemented some tests in the same fashion as the ones for jay. For them to run on CI, openpyxl has to be installed during the appveyor build (here and in the subsequent pip-installs, I guess). |
@carrascomj thanks for your contribution! Nice to know we can't use @samukweku If you only adjust the title to |
@carrascomj do you have any ideas if the issue mentioned here has ever been fixed? Have you had a chance to test |
Ups, I had not seen that issue nor the PR associated with it, sorry. In terms of the mentioned 10x performance decrease, it was a problem with an edge-case: the xlsx contained a link to another file (see pandas-dev/pandas#35029 (comment)). With that said, the code in this PR is far from optimal. I will run some benchmarks and see how far openpyxl can go. However, I agree that a custom parser goes more in line with the perfomance (like it was suggested before) expected from datatable. |
I tested the performance with master (4 columns, 200000 rows: The changes on 977af71 do not change performance but simplify the code.
|
if subpath in wb.sheetnames: | ||
sheetname = subpath | ||
else: | ||
if "/" in subpath: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appveyor is failing in Windows because of the conversion to a path with backslashes.
C:\Path\to\file.xlsx\Sheet\A2:B9
fails (also for xls files with the current xlrd-based parser).C:\Path\to\file.xlsx\Sheet/A2:B9
also fails, since there is not such aC:\Path\to\file.xlsx\Sheet
sheet, aftersplitting.C:\Path\to\file.xlsx/Sheet/A2:B9
works.
The intended behavior in Windows is to work with (1), right? I guess both xlsx.py and xls.py should check for backslashes if no (sheetname, range2d)
pair was found when a subpath is specified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Windows support was added after the excel reading feature, so, you're right, we probably missed the backslash issue. My feeling is that on Windows we should support both types of slashes. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I guess it would be more consistent for the user if they do not have to change the code for different platforms. I'll give it a try in a day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue has been fixed as of #3220
I'll give it a try in a day.
Do you still want to run some benchmarks for this PR?
Thanks for testing @carrascomj. "2x performance decrease" seems quite significant... |
Related to #2632
Description
xlrd dropped support for anything but
xls
at 2.0.0 python-excel/xlrd#371. Using the old version may cause security vulnerabilities and potential incorrect parsing. It is also a problem for people that have installed pandas in their environment (with, more likely, openpyxl and xlrd>2.0).Implementation
xlsx
, the recommended alternative.I would like to know if this is desirable before writing any unitests.
Thanks!