-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate pandas read_excel engine from xlrd to openpyxl. #325
Conversation
-Modify FlowCal.excel_ui.read_table() to first try openpyxl engine when reading an Excel file and then xlrd (which is the only package that can read old-style XLS files). -Expose the pd.read_excel() `engine` parameter to the user from FlowCal.excel_ui.read_table(). -Add new unit test that reads old-style XLS file. -Modify requirements file to require openpyxl package.
We might also consider using |
It definitely can. I'm using The thing I don't like about |
Hmm, good to know. Without
I still haven't forgiven them for |
I'm not completely convinced we need to keep xldr. The code to support both seems complex, especially inside I'm ok with dropping xlrd completely even if it means no support for .xls (are people still using that?). The resulting code would be simpler and easier to maintain. |
Superficially, I think having One thing to consider, though, is what dropping |
OK, let's leave |
Dropping |
Synopsis
FlowCal relies on
pandas
to read Excel files, andpandas
is changing its underlying parsing engine to move away from a package that is no longer maintained (xlrd
). Moreover,xlrd
recently started throwingDeprecationWarnings
in some use cases. This pull request establishes a preference in FlowCal for the actively maintainedopenpyxl
engine overxlrd
.xlrd
is still retained, though, because it can parse old-format Excel files (.xls
).Background –
xlrd
,defusedxml
,getiterator
, andDeprecationWarning
xlrd
throws aDeprecationWarning
when it uses thedefusedxml
XML parsing library to read an Excel file in Python 3.8.defusedxml
, which is installed with Anaconda since v5.3.0, emulates the standard Python XML parsing library (xml.etree.ElementTree
) and additionally prevents malicious XML operations, making it the preferred parser ofxlrd
. TheDeprecationWarning
occurs becausexlrd
misunderstandsdefusedxml
's incomplete emulation ofxml.etree.ElementTree
and resorts to a function (getiterator
) that has been deprecated since Python 3.2 and started throwing aDeprecationWarning
in Python 3.8. This decision occurs despite the availability of the recommended alternative (iter
) becausexlrd
initially checks for anElementTree
class in the parsing library before checking foriter
, and thexml.etree.ElementTree
module contains this class whereas thedefusedxml.ElementTree
module does not. Whenxlrd
later encounters adefusedxml
-producedxml.etree.ElementTree.ElementTree
, it usesgetiterator()
instead ofiter()
, thus resulting in theDeprecationWarning
.Proposed solution
To address the
DeprecationWarning
and the impending transition toopenpyxl
, I modified FlowCal to read Excel files first withopenpyxl
and then upon failure withxlrd
. I also exposed theengine
parameter so it can alternatively be specified explicitly.I chose to retain
xlrd
because it's the only package available that can read old-format Excel files (.xls
). I also added a unit test for this.NOTE: The
openpyxl
version I selected for therequirements.txt
file is the version associated with Anaconda v4.3.0, which is the lowest version of Anaconda we support (below it,matplotlib
violatesrequirements.txt
).Unit test behavior before pull request (9cd83ef)
All unit tests pass with the following builds: Python 3.8, Anaconda 2020.02; Python 3.8, vanilla (no Anaconda, +openpyxl); Python 3.8, vanilla −openpyxl; Python 2.7, Anaconda 2019.10; Python 2.7, vanilla; Python 2.7, vanilla −openpyxl; Python 2.7, Anaconda 4.3.0 (oldest supported).
python -m unittest test.test_excel_ui
throws aDeprecationWarning
with Python 3.8, Anaconda 2020.02.Unit test behavior after pull request (cfcf012)
All unit tests pass with the same builds.
python -m unittest test.test_excel_ui
no longer throws aDeprecationWarning
with Python 3.8, Anaconda 2020.02.Effects on performance
FlowCal reads Excel files ~50% slower with
openpyxl
than withxlrd
in Python 3.8, Anaconda 2020.02:Before pull request (9cd83ef) (
xlrd
):After pull request (cfcf012) (
openpyxl
):This is a tolerable slow down in my opinion.