Migrate pandas read_excel engine from xlrd to openpyxl. #325

JS3xton · 2020-05-06T04:56:48Z

Synopsis

FlowCal relies on pandas to read Excel files, and pandas is changing its underlying parsing engine to move away from a package that is no longer maintained (xlrd). Moreover, xlrd recently started throwing DeprecationWarnings in some use cases. This pull request establishes a preference in FlowCal for the actively maintained openpyxl engine over xlrd. xlrd is still retained, though, because it can parse old-format Excel files (.xls).

Background – `xlrd`, `defusedxml`, `getiterator`, and `DeprecationWarning`

xlrd throws a DeprecationWarning when it uses the defusedxml XML parsing library to read an Excel file in Python 3.8. defusedxml, which is installed with Anaconda since v5.3.0, emulates the standard Python XML parsing library (xml.etree.ElementTree) and additionally prevents malicious XML operations, making it the preferred parser of xlrd. The DeprecationWarning occurs because xlrd misunderstands defusedxml's incomplete emulation of xml.etree.ElementTree and resorts to a function (getiterator) that has been deprecated since Python 3.2 and started throwing a DeprecationWarning in Python 3.8. This decision occurs despite the availability of the recommended alternative (iter) because xlrd initially checks for an ElementTree class in the parsing library before checking for iter, and the xml.etree.ElementTree module contains this class whereas the defusedxml.ElementTree module does not. When xlrd later encounters a defusedxml-produced xml.etree.ElementTree.ElementTree, it uses getiterator() instead of iter(), thus resulting in the DeprecationWarning.

Proposed solution

To address the DeprecationWarning and the impending transition to openpyxl, I modified FlowCal to read Excel files first with openpyxl and then upon failure with xlrd. I also exposed the engine parameter so it can alternatively be specified explicitly.

I chose to retain xlrd because it's the only package available that can read old-format Excel files (.xls). I also added a unit test for this.

NOTE: The openpyxl version I selected for the requirements.txt file is the version associated with Anaconda v4.3.0, which is the lowest version of Anaconda we support (below it, matplotlib violates requirements.txt).

Unit test behavior before pull request (`9cd83ef`)

All unit tests pass with the following builds: Python 3.8, Anaconda 2020.02; Python 3.8, vanilla (no Anaconda, +openpyxl); Python 3.8, vanilla −openpyxl; Python 2.7, Anaconda 2019.10; Python 2.7, vanilla; Python 2.7, vanilla −openpyxl; Python 2.7, Anaconda 4.3.0 (oldest supported).

python -m unittest test.test_excel_ui throws a DeprecationWarning with Python 3.8, Anaconda 2020.02.

Unit test behavior after pull request (`cfcf012`)

All unit tests pass with the same builds.

python -m unittest test.test_excel_ui no longer throws a DeprecationWarning with Python 3.8, Anaconda 2020.02.

Effects on performance

FlowCal reads Excel files ~50% slower with openpyxl than with xlrd in Python 3.8, Anaconda 2020.02:

Before pull request (9cd83ef) (xlrd):

>python -m timeit -n 1000 -s "import FlowCal as fc" "fc.excel_ui.read_table('./test/test_excel_ui.xlsx', sheetname='Instruments', index_col='ID')"
1000 loops, best of 5: 5.01 msec per loop

After pull request (cfcf012) (openpyxl):

>python -m timeit -n 1000 -s "import FlowCal as fc" "fc.excel_ui.read_table('./test/test_excel_ui.xlsx', sheetname='Instruments', index_col='ID')"
1000 loops, best of 5: 7.81 msec per loop

This is a tolerable slow down in my opinion.

-Modify FlowCal.excel_ui.read_table() to first try openpyxl engine when reading an Excel file and then xlrd (which is the only package that can read old-style XLS files). -Expose the pd.read_excel() `engine` parameter to the user from FlowCal.excel_ui.read_table(). -Add new unit test that reads old-style XLS file. -Modify requirements file to require openpyxl package.

JS3xton · 2020-05-06T17:19:38Z

We might also consider using openpyxl as the ExcelWriter engine, which would eliminate the need for the XlsxWriter package. I don't know if openpyxl can do everything we're currently doing with XlsxWriter, but this should be explored in a separate issue.

castillohair · 2020-05-06T17:29:05Z

We might also consider using openpyxl as the ExcelWriter engine, which would eliminate the need for the XlsxWriter package. I don't know if openpyxl can do everything we're currently doing with XlsxWriter, but this should be explored in a separate issue.

It definitely can. I'm using openpyxl as the only Excel engine in another package of mine.

The thing I don't like about openpyxl is that its API changes quite often. Although pandas is similarly fickle.

JS3xton · 2020-05-06T18:58:28Z

It definitely can. I'm using openpyxl as the only Excel engine in another package of mine.

The thing I don't like about openpyxl is that its API changes quite often.

Hmm, good to know.

Without openpyxl, I think xlrd is going to start actively failing with .xlsx files in Python 3.9 + Anaconda, so I still think incorporating openpyxl and trying it first is the best way to get ahead of this problem.

pandas is similarly fickle.

I still haven't forgiven them for sheetname -> sheet_name 😆 (although I sympathize with the underlying rationale).

castillohair · 2020-06-13T18:55:14Z

I'm not completely convinced we need to keep xldr. The code to support both seems complex, especially inside excel_ui.read_table(), and may prove to be difficult to maintain in the future. I especially dislike how the error messages are being parsed to make decisions.

I'm ok with dropping xlrd completely even if it means no support for .xls (are people still using that?). The resulting code would be simpler and easier to maintain.

JS3xton · 2020-06-14T16:52:18Z

Superficially, I think having xlrd as a backup is superior to not having it, but I'm not especially tied to it (especially given that it's no longer maintained).

One thing to consider, though, is what dropping xlrd would do to our minimum package requirements. It looks like our minimum pandas requirement would go from 0.16.1 to 0.25.0, and pandas version 0.25.0 only supports Python 3. For Anaconda, the minimum version we would support out of the box for Python 3 would only be the current one (2020.02). And we would not support any Anaconda versions for Python 2 (the last one, 2019.07, has pandas version 0.24.2). So we'd have to rely on the requirements.txt file to manually update pandas for unsupported Anaconda versions, and I'm not even sure that would work for Python 2. It may therefore also be worth holding on to xlrd for Python backwards compatibility reasons. If we ever drop Python 2, that might be a good time to drop xlrd.

The decision-making on error messages was just me trying to be really precise in my error handling. I imagine most people would just except on the Errors (some of which are pretty general...) and then move on to their exception-handling code. We could fall back to that strategy if we wanted, which would be less vulnerable to changes in the error messages.

castillohair · 2020-06-15T01:13:49Z

OK, let's leave xlrd. But if we ever have issues with excel_ui.read_table(), I suggest immediately dropping xlrd. While we're at it, I'm in favor of dropping support for Python 2 as soon as it becomes slightly annoying to maintain it.

JS3xton · 2020-06-15T02:25:54Z

Dropping xlrd or Python 2 at the first sign of trouble seems good to me. A bunch of projects actually already pledged to drop Python 2. It's likely some of our user base are not programmers, though, and may benefit from prolonged support of old tools, so I think we should keep Python 2 around for a little longer if convenient. Dropping it will also likely require some work to purge all the compatibility code (e.g. all the six package references).

JS3xton added Excel UI tests labels May 6, 2020

JS3xton requested a review from castillohair May 6, 2020 04:56

Added openpyxl dependency to docs and setup files.

ed91526

This was referenced May 6, 2020

Migrate pandas ExcelWriter engine from XlsxWriter to openpyxl #326

Closed

Change pandas ExcelWriter engine from XlsxWriter to openpyxl #329

Merged

castillohair merged commit 9af6468 into taborlab:develop Jun 15, 2020

JS3xton deleted the excel-engine branch June 15, 2020 02:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate pandas read_excel engine from xlrd to openpyxl. #325

Migrate pandas read_excel engine from xlrd to openpyxl. #325

JS3xton commented May 6, 2020

JS3xton commented May 6, 2020

castillohair commented May 6, 2020

JS3xton commented May 6, 2020

castillohair commented Jun 13, 2020 •

edited

Loading

JS3xton commented Jun 14, 2020

castillohair commented Jun 15, 2020

JS3xton commented Jun 15, 2020

Migrate pandas read_excel engine from xlrd to openpyxl. #325

Migrate pandas read_excel engine from xlrd to openpyxl. #325

Conversation

JS3xton commented May 6, 2020

Synopsis

Background – xlrd, defusedxml, getiterator, and DeprecationWarning

Proposed solution

Unit test behavior before pull request (9cd83ef)

Unit test behavior after pull request (cfcf012)

Effects on performance

JS3xton commented May 6, 2020

castillohair commented May 6, 2020

JS3xton commented May 6, 2020

castillohair commented Jun 13, 2020 • edited Loading

JS3xton commented Jun 14, 2020

castillohair commented Jun 15, 2020

JS3xton commented Jun 15, 2020

Background – `xlrd`, `defusedxml`, `getiterator`, and `DeprecationWarning`

Unit test behavior before pull request (`9cd83ef`)

Unit test behavior after pull request (`cfcf012`)

castillohair commented Jun 13, 2020 •

edited

Loading