-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add xlsxwriter to improve to_excel performance #701
Add xlsxwriter to improve to_excel performance #701
Conversation
Codecov Report
@@ Coverage Diff @@
## main #701 +/- ##
=====================================
Coverage 94.8% 94.9%
=====================================
Files 58 58
Lines 5856 5853 -3
=====================================
- Hits 5557 5555 -2
+ Misses 299 298 -1
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
I just pip-installed pyam in a clean environment and xlsxwriter was installed automatically already. Not sure if it's necessary to add it as an explicit dependency? |
Interesting, it's not a dependency of pandas (https://github.com/pandas-dev/pandas/blob/main/setup.cfg#L33). |
You're right, looks like I was confused between two environments. Running some profiling now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that openpyxl is hard-coded in line
Line 195 in 759120f
excel_file = pd.ExcelFile(data, engine="openpyxl") |
Also, I guess it's possible to remove openpyxl as a dependency?
I just ran some tests and can confirm that
- removing the hard-coded engine works whether or not xlsxwriter is installed
- performance is significantly better if xlsxwriter is installed, no need to specify it a engine explicitly (on a 30MB xlsx file, the resulting file is smaller by 50%, time is -45%, CPU is -80%)
Co-authored-by: Daniel Huppmann <dh@dergelbesalon.at>
There is another usage of pd.ExcelWriter where the engine is not explicitly specified. Please make the two usages consistent. Line 2377 in 759120f
|
Co-authored-by: Daniel Huppmann <dh@dergelbesalon.at>
Updated the usage of xlsxwriter in https://github.com/IAMconsortium/pyam/blob/main/pyam/utils.py#L135 is there a test for |
It's tested implicitly via Line 56 in 759120f
The line you found was a hacky attempt by me to make xlsx files look "nice" by having useful column widths. |
Ok, should I check if using xlsxwriter triggered the error? |
Let me give this a quick try, I think you have more urgent things on your to-do list... |
True, just thought this might be a quick one to get off the list ... |
Please confirm that this PR has done the following:
Tests Added (none needed except for possibly benchmarks)Documentation AddedDescription of PR
Added
xlsxwriter
to the dependencies ofpyam
.pandas
usesxlsxwriter
overopenpyxl
if it's found. This means that ifxlsxwriter
is found on the system it is used without any changes required.According to benchmarks (https://exchangetuts.com/python-fastest-way-to-write-pandas-dataframe-to-excel-on-multiple-sheets-1640154784194443),
xlsxwriter
is significantly faster thanopenpyxl
.Should I set up some benchmarks of our own to test it for pyam?