Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate pandas read_excel engine from xlrd to openpyxl. #325

Merged
merged 2 commits into from
Jun 15, 2020
Merged

Migrate pandas read_excel engine from xlrd to openpyxl. #325

merged 2 commits into from
Jun 15, 2020

Conversation

JS3xton
Copy link
Contributor

@JS3xton JS3xton commented May 6, 2020

Synopsis

FlowCal relies on pandas to read Excel files, and pandas is changing its underlying parsing engine to move away from a package that is no longer maintained (xlrd). Moreover, xlrd recently started throwing DeprecationWarnings in some use cases. This pull request establishes a preference in FlowCal for the actively maintained openpyxl engine over xlrd. xlrd is still retained, though, because it can parse old-format Excel files (.xls).

Background – xlrd, defusedxml, getiterator, and DeprecationWarning

xlrd throws a DeprecationWarning when it uses the defusedxml XML parsing library to read an Excel file in Python 3.8. defusedxml, which is installed with Anaconda since v5.3.0, emulates the standard Python XML parsing library (xml.etree.ElementTree) and additionally prevents malicious XML operations, making it the preferred parser of xlrd. The DeprecationWarning occurs because xlrd misunderstands defusedxml's incomplete emulation of xml.etree.ElementTree and resorts to a function (getiterator) that has been deprecated since Python 3.2 and started throwing a DeprecationWarning in Python 3.8. This decision occurs despite the availability of the recommended alternative (iter) because xlrd initially checks for an ElementTree class in the parsing library before checking for iter, and the xml.etree.ElementTree module contains this class whereas the defusedxml.ElementTree module does not. When xlrd later encounters a defusedxml-produced xml.etree.ElementTree.ElementTree, it uses getiterator() instead of iter(), thus resulting in the DeprecationWarning.

Proposed solution

To address the DeprecationWarning and the impending transition to openpyxl, I modified FlowCal to read Excel files first with openpyxl and then upon failure with xlrd. I also exposed the engine parameter so it can alternatively be specified explicitly.

I chose to retain xlrd because it's the only package available that can read old-format Excel files (.xls). I also added a unit test for this.

NOTE: The openpyxl version I selected for the requirements.txt file is the version associated with Anaconda v4.3.0, which is the lowest version of Anaconda we support (below it, matplotlib violates requirements.txt).

Unit test behavior before pull request (9cd83ef)

All unit tests pass with the following builds: Python 3.8, Anaconda 2020.02; Python 3.8, vanilla (no Anaconda, +openpyxl); Python 3.8, vanilla −openpyxl; Python 2.7, Anaconda 2019.10; Python 2.7, vanilla; Python 2.7, vanilla −openpyxl; Python 2.7, Anaconda 4.3.0 (oldest supported).

python -m unittest test.test_excel_ui throws a DeprecationWarning with Python 3.8, Anaconda 2020.02.

Unit test behavior after pull request (cfcf012)

All unit tests pass with the same builds.

python -m unittest test.test_excel_ui no longer throws a DeprecationWarning with Python 3.8, Anaconda 2020.02.

Effects on performance

FlowCal reads Excel files ~50% slower with openpyxl than with xlrd in Python 3.8, Anaconda 2020.02:

Before pull request (9cd83ef) (xlrd):

>python -m timeit -n 1000 -s "import FlowCal as fc" "fc.excel_ui.read_table('./test/test_excel_ui.xlsx', sheetname='Instruments', index_col='ID')"
1000 loops, best of 5: 5.01 msec per loop

After pull request (cfcf012) (openpyxl):

>python -m timeit -n 1000 -s "import FlowCal as fc" "fc.excel_ui.read_table('./test/test_excel_ui.xlsx', sheetname='Instruments', index_col='ID')"
1000 loops, best of 5: 7.81 msec per loop

This is a tolerable slow down in my opinion.

-Modify FlowCal.excel_ui.read_table() to first try openpyxl engine
 when reading an Excel file and then xlrd (which is the only package
 that can read old-style XLS files).
-Expose the pd.read_excel() `engine` parameter to the user from
 FlowCal.excel_ui.read_table().
-Add new unit test that reads old-style XLS file.
-Modify requirements file to require openpyxl package.
@JS3xton
Copy link
Contributor Author

JS3xton commented May 6, 2020

We might also consider using openpyxl as the ExcelWriter engine, which would eliminate the need for the XlsxWriter package. I don't know if openpyxl can do everything we're currently doing with XlsxWriter, but this should be explored in a separate issue.

@castillohair
Copy link
Collaborator

We might also consider using openpyxl as the ExcelWriter engine, which would eliminate the need for the XlsxWriter package. I don't know if openpyxl can do everything we're currently doing with XlsxWriter, but this should be explored in a separate issue.

It definitely can. I'm using openpyxl as the only Excel engine in another package of mine.

The thing I don't like about openpyxl is that its API changes quite often. Although pandas is similarly fickle.

@JS3xton
Copy link
Contributor Author

JS3xton commented May 6, 2020

It definitely can. I'm using openpyxl as the only Excel engine in another package of mine.

The thing I don't like about openpyxl is that its API changes quite often.

Hmm, good to know.

Without openpyxl, I think xlrd is going to start actively failing with .xlsx files in Python 3.9 + Anaconda, so I still think incorporating openpyxl and trying it first is the best way to get ahead of this problem.

pandas is similarly fickle.

I still haven't forgiven them for sheetname -> sheet_name 😆 (although I sympathize with the underlying rationale).

@castillohair
Copy link
Collaborator

castillohair commented Jun 13, 2020

I'm not completely convinced we need to keep xldr. The code to support both seems complex, especially inside excel_ui.read_table(), and may prove to be difficult to maintain in the future. I especially dislike how the error messages are being parsed to make decisions.

I'm ok with dropping xlrd completely even if it means no support for .xls (are people still using that?). The resulting code would be simpler and easier to maintain.

@JS3xton
Copy link
Contributor Author

JS3xton commented Jun 14, 2020

Superficially, I think having xlrd as a backup is superior to not having it, but I'm not especially tied to it (especially given that it's no longer maintained).

One thing to consider, though, is what dropping xlrd would do to our minimum package requirements. It looks like our minimum pandas requirement would go from 0.16.1 to 0.25.0, and pandas version 0.25.0 only supports Python 3. For Anaconda, the minimum version we would support out of the box for Python 3 would only be the current one (2020.02). And we would not support any Anaconda versions for Python 2 (the last one, 2019.07, has pandas version 0.24.2). So we'd have to rely on the requirements.txt file to manually update pandas for unsupported Anaconda versions, and I'm not even sure that would work for Python 2. It may therefore also be worth holding on to xlrd for Python backwards compatibility reasons. If we ever drop Python 2, that might be a good time to drop xlrd.


The decision-making on error messages was just me trying to be really precise in my error handling. I imagine most people would just except on the Errors (some of which are pretty general...) and then move on to their exception-handling code. We could fall back to that strategy if we wanted, which would be less vulnerable to changes in the error messages.

@castillohair
Copy link
Collaborator

OK, let's leave xlrd. But if we ever have issues with excel_ui.read_table(), I suggest immediately dropping xlrd. While we're at it, I'm in favor of dropping support for Python 2 as soon as it becomes slightly annoying to maintain it.

@castillohair castillohair merged commit 9af6468 into taborlab:develop Jun 15, 2020
@JS3xton
Copy link
Contributor Author

JS3xton commented Jun 15, 2020

Dropping xlrd or Python 2 at the first sign of trouble seems good to me. A bunch of projects actually already pledged to drop Python 2. It's likely some of our user base are not programmers, though, and may benefit from prolonged support of old tools, so I think we should keep Python 2 around for a little longer if convenient. Dropping it will also likely require some work to purge all the compatibility code (e.g. all the six package references).

@JS3xton JS3xton deleted the excel-engine branch June 15, 2020 02:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants