Openpyxl engine for reading excel files #25092

tdamsma · 2019-02-02T13:10:08Z

closes Support for openpyxl when reading XLSX (Excel 2010) files #11499
tests added
tests passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Builds upon #26233, so that should be merged first.

This PR adds an almost fully compatible OpenPyXL based engine to excel_read, next to the xlrd engine.

Using OpenPyXL will also allow reading and writing in excel tables referenced by name which I intend to put in a future PR, see also discussion of #24862.

As the current xlrd implementation is pretty tightly coupled to the the text based parsing options, there are many supported keywords in excel_read that I think eventually do not belong there. With this PR I left all of them in and support them as much as possible even though this:

Make the code much more complicated
Makes the code slower

The way this is implemented is that the excel file is read into a dataframe early, and all the keywords like convert, dtype, convert_float etc are dealt with by modifying the dataframe. Because of this some of the parser code needed to be re-implemented, so this is not ideal. An alternative would be to just not support these keywords at all for the openpyxl engine.

pep8speaks · 2019-02-02T13:10:13Z

Hello @tdamsma! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-06-28 14:48:52 UTC

codecov · 2019-02-02T13:42:10Z

Codecov Report

Merging #25092 into master will decrease coverage by 49.49%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #25092      +/-   ##
==========================================
- Coverage   92.37%   42.88%   -49.5%     
==========================================
  Files         166      166              
  Lines       52420    52420              
==========================================
- Hits        48423    22479   -25944     
- Misses       3997    29941   +25944

Flag	Coverage Δ
#multiple	`?`
#single	`42.88% <ø> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/io/formats/latex.py	`0% <0%> (-100%)`	⬇️
pandas/core/categorical.py	`0% <0%> (-100%)`	⬇️
pandas/io/sas/sas_constants.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/plotting.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/converter.py	`0% <0%> (-100%)`	⬇️
pandas/io/formats/html.py	`0% <0%> (-99.35%)`	⬇️
pandas/core/groupby/categorical.py	`0% <0%> (-95.46%)`	⬇️
pandas/io/sas/sas7bdat.py	`0% <0%> (-91.17%)`	⬇️
pandas/io/sas/sas_xport.py	`0% <0%> (-90.15%)`	⬇️
pandas/core/tools/numeric.py	`10.44% <0%> (-89.56%)`	⬇️
... and 123 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bb43726...821fa4d. Read the comment docs.

codecov · 2019-02-02T13:42:10Z

Codecov Report

Merging #25092 into master will decrease coverage by 50.58%.
The diff coverage is 14.81%.

@@             Coverage Diff             @@
##           master   #25092       +/-   ##
===========================================
- Coverage   91.71%   41.13%   -50.59%     
===========================================
  Files         178      178               
  Lines       50771    50931      +160     
===========================================
- Hits        46567    20949    -25618     
- Misses       4204    29982    +25778

Flag	Coverage Δ
#multiple	`?`
#single	`41.13% <14.81%> (-0.15%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/excel/_openpyxl.py	`12.97% <14.37%> (-71.74%)`	⬇️
pandas/io/excel/_base.py	`28.5% <50%> (-63.32%)`	⬇️
pandas/io/formats/latex.py	`0% <0%> (-100%)`	⬇️
pandas/plotting/_matplotlib/__init__.py	`0% <0%> (-100%)`	⬇️
pandas/io/sas/sas_constants.py	`0% <0%> (-100%)`	⬇️
pandas/core/groupby/categorical.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/plotting.py	`0% <0%> (-100%)`	⬇️
pandas/io/formats/html.py	`0% <0%> (-99.37%)`	⬇️
pandas/io/sas/sas7bdat.py	`0% <0%> (-91.16%)`	⬇️
pandas/io/sas/sas_xport.py	`0% <0%> (-90.1%)`	⬇️
... and 134 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d47fc0c...614d972. Read the comment docs.

tdamsma · 2019-02-02T13:43:17Z

@WillAyd This is still a work in progress, but feedback would be welcome. I had to make a few adjustments to the tests to be able to run them for a different engine.
As openpyxl is pretty smart in parsing the excel files to data I decided to not use the TextParser of PythonParser, but (re-implement) some if the api specifically for the OpenPyXL reader. Not sure if this is really desirable, but as the parsers all seem to be meant to deal with text files this felt like the best approach. On my local machine 70 of the 78 tests for openpyxl reader are passing

jreback · 2019-02-02T13:46:07Z

I think i mentioned this before
but would like to split up the excel.py first to make a subdir then each reader / writer can go in slimmer files

would take a precursor PR to do this

tdamsma · 2019-02-02T15:53:59Z

oops, think I missed that. That would certainly help as the file is huge.

…etty messy

…-reader

tdamsma · 2019-02-07T13:50:44Z

pandas/tests/io/test_excel.py

@@ -119,11 +134,11 @@ def get_exceldf(self, basename, ext, *args, **kwds):
 class ReadingTestsBase(SharedItems):
    # This is based on ExcelWriterBase

-    @pytest.fixture(autouse=True, params=['xlrd', None])
-    def set_engine(self, request):
+    @pytest.fixture(autouse=True)


I removed the use of engine as a parametrized fixture, in favour of setting it as a class attribute of derived, engine specific, classes

This should be reverted and just parametrize on openpyxl as well

Still want to revert this. We prefer parametrization to creating subclasses for testing, and this also had the side effect of reducing coverage on None

Ok, will give it a go

Still not clear on what the point of removing this was - we moved away from subclasses previously so this takes us backwards. This also removes coverage for engine=None - what issue was this causing that required reverting to subclasses?

tdamsma · 2019-02-07T13:52:56Z

pandas/tests/io/test_excel.py

+
+        if (url_table.columns[0] not in local_table.columns
+                and url_table.columns[0] == local_table.columns[0]):
+            pytest.skip('?!? what is going on here?')


I have no idea what is going on here. Both columns are equal, but one is not in a list that contains the other?

…-reader # Conflicts: # pandas/io/excel.py Manually migrated changes to _openpyxl.py

pandas/io/excel/_openpyxl.py

pandas/io/excel/_base.py

WillAyd · 2019-02-11T22:21:12Z

pandas/tests/io/test_excel.py

@@ -119,11 +134,11 @@ def get_exceldf(self, basename, ext, *args, **kwds):
 class ReadingTestsBase(SharedItems):
    # This is based on ExcelWriterBase

-    @pytest.fixture(autouse=True, params=['xlrd', None])
-    def set_engine(self, request):
+    @pytest.fixture(autouse=True)


This should be reverted and just parametrize on openpyxl as well

WillAyd · 2019-02-11T22:22:15Z

pandas/tests/io/test_excel.py

@@ -49,6 +49,21 @@ def ignore_xlrd_time_clock_warning():
        yield


+@contextlib.contextmanager
+def ignore_openpyxl_unknown_extension_warning():


Why do you need this?

Openpyxl does not support all elements that are in xlsx sheets, and raises warnings for when these are found. See also https://bitbucket.org/openpyxl/openpyxl/issues/537/userwarning-unknown-extension-is-not and https://stackoverflow.com/questions/34322231/python-2-7-openpyxl-userwarning.

This code suppresses that warning as the tests are making assertions about which warnings should be raised.

Do we still need this? I think from the cleanup PR you had we should have done away with unknown extensions, no?

unknown extensions refer to excel internals like Conditional Formatting. That is and excel extension that openpyxl does not support. Perhaps I can implement a more generic warnings filter as a fixture

Do you know which files were actually causing this? If conditional formatting is what is causing this I'm hesitant to blanket add this as I would think issues would be limited to a handful of files at most

…eader

tdamsma · 2019-02-12T08:35:54Z

@WillAyd I somehow can't comment inline on

This should be reverted and just parametrize on openpyxl as well

but I really see how this should work with how the tests are set-up now. As i see it the TestXlrdReader is responsible for running the ReadingTestsBase tests with the 'xlrd' engine. So I added a TestOpenpyxlReader class for running the openpyxl tests. But as the subclasses are responsible for defining the engine, there is no need for the baseclass to parametrize the engine. In fact this would cause the xlrd specific tests to be run with the openpyxl engine?

Of course the writer tests follow a different pattern This leads to a pretty complicated decorator like

@pytest.mark.parametrize("engine,ext", [
    pytest.param('openpyxl', '.xlsx', marks=pytest.mark.skipif(
        not td.safe_import('openpyxl'), reason='No openpyxl')),
    pytest.param('openpyxl', '.xlsm', marks=pytest.mark.skipif(
        not td.safe_import('openpyxl'), reason='No openpyxl')),
    pytest.param('xlwt', '.xls', marks=pytest.mark.skipif(
        not td.safe_import('xlwt'), reason='No xlwt')),
    pytest.param('xlsxwriter', '.xlsx', marks=pytest.mark.skipif(
        not td.safe_import('xlsxwriter'), reason='No xlsxwriter'))
])

and then every set of engine specific tests need redefine which extension and engine combinations it should run for. I guess judging from the naming (ReadingTestsBase vs _WriterBase) the sets of read and write tests follow different patterns anyway

…-reader

… openpyxl in read_only mode

jreback · 2019-06-28T12:40:54Z

doc/source/whatsnew/v0.25.0.rst

@@ -133,6 +133,7 @@ Other Enhancements
 - :meth:`DataFrame.describe` now formats integer percentiles without decimal point (:issue:`26660`)
 - Added support for reading SPSS .sav files using :func:`read_spss` (:issue:`26537`)
 - Added new option ``plotting.backend`` to be able to select a plotting backend different than the existing ``matplotlib`` one. Use ``pandas.set_option('plotting.backend', '<backend-module>')`` where ``<backend-module`` is a library implementing the pandas plotting API (:issue:`14130`)
+- :func:`read_excel` can now use openpyxl to read Excel files via the ``engine='openpyxl'`` argument. This will become the default in a future release (:issue:`11499`)


would put openpyxl in double-backquotes, but if this is the only issue, then can later

jreback · 2019-06-28T12:41:31Z

lgtm. @WillAyd

jreback

actually we need to add options for this in config_init.py; we have writers, need to add this to readers (and actually test this). I think this should be in this PR.

tdamsma · 2019-06-28T14:11:21Z

@WillAyd, there are still some situations where we need a context manager for openpyxl tests. And as this is in the same places we also need the xlrd context manager (because this is to suppress engine related warnings to be able to assert pandas warnings are raised), the tests get really ugly (nested context managers). That is why I proposed to combine the warning suppression for both engines, but as you disgaree, do you have a, alternative?

jreback · 2019-06-28T14:23:47Z

tests are here for the writers: https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/excel/test_writers.py#L224

you can do something similar for the readers

WillAyd · 2019-06-28T14:27:38Z

@tdamsma I'd rather not just blanket catch everything in one context manager because it makes it harder to decouple down the road. Taking a look now

tdamsma · 2019-06-28T14:37:05Z

@jreback, th

tests are here for the writers: https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/excel/test_writers.py#L224

you can do something similar for the readers

Thanks, probably will have to wait till Monday though

WillAyd · 2019-06-28T14:50:11Z

@tdamsma rather than catch the openpyxl warning I just regenerated the Excel files causing them. Whatever was throwing that is not relevant to the failing test and probably just an artifact of an old file conversion

WillAyd · 2019-06-28T15:04:43Z

@jreback as mentioned in person I want to clean up the testing of option setting even on the writing side (_WriterBase) so would rather tackle the request to do something similar for reading in a follow up PR

WillAyd · 2019-06-28T16:44:31Z

@tdamsma thanks for all your help here!

Themanwithoutaplan · 2019-06-28T16:54:56Z

pandas/io/excel/_openpyxl.py

+    def load_workbook(self, filepath_or_buffer: FilePathOrBuffer):
+        from openpyxl import load_workbook
+        return load_workbook(filepath_or_buffer,
+                             read_only=True, data_only=True)


FWIW, you almost certainly want keep_links=False in there as well. On files that include data from other workbooks, Excel creates caches of the relevant worksheets and openpyxl preserves them by default. These can be pretty big and are read into memory and almost certainly irrelevant for Pandas.from_excel(). For an example see openpyxl bug #494.

Good suggestion

for the record:

https://openpyxl.readthedocs.io/en/stable/api/openpyxl.reader.excel.html?highlight=keep_links

https://bitbucket.org/openpyxl/openpyxl/issues/494

tdamsma · 2019-06-28T18:14:58Z

@tdamsma thanks for all your help here!

Thank you for your patience!

cjw296 · 2019-08-12T22:05:03Z

What's the state of this? I see it's merged but is it in a release? Still getting people opening pandas issues against xlrd for xlsx files: https://github.com/python-excel/xlrd/issues/355

jreback · 2019-08-12T22:07:11Z

0.25.0

Maverick494 · 2019-08-12T22:15:55Z

I have 0.25.0 and it is telling me that I need xlrd to read_excel.

ImportError: Missing optional dependency 'xlrd'. Install xlrd >= 1.0.0 for Excel support Use pip or conda to install xlrd.

pip list
pandas 0.25.0

Even if I install it the result is:

File "d:\python\python37\lib\site-packages\pandas\io\excel_base.py", line 356, in init
filepath_or_buffer.seek(0)

UnsupportedOperation: seek

miroli · 2019-09-16T10:15:27Z

@Maverick494 It has yet to show up in the documentation, but it's now possible to specify openpyxl as the engine, i.e pd.read_excel(<path>, engine='openpyxl'). Works for me with pandas 0.25.1.

cjw296 · 2019-09-16T11:41:48Z

Please can you change the default to be openpyxl?

tdamsma added 3 commits February 2, 2019 13:46

prepare testing reading excel files with multiple engines

e29b4c0

add openpyxl tests

e0199a8

implement first version of openpyxl reader

ce4eb01

tdamsma added 2 commits February 2, 2019 14:23

pep8 issues

b25877e

suppress openpyxl warnings

821fa4d

jreback added the IO Excel read_excel, to_excel label Feb 2, 2019

tdamsma added 4 commits February 7, 2019 14:14

add code for all edge cases that are tested for. Unfortunately got pr…

4694668

…etty messy

formatting

712f1ef

Merge commit '683c7b55f5195fdf4f524239066cbf6f1301f0e7' into openpyxl…

1d49a0e

…-reader

improve docstring

1473c0e

tdamsma commented Feb 7, 2019

View reviewed changes

tdamsma added 2 commits February 7, 2019 14:54

also test openpyxl reader for .xlsm files

6e8ffba

explicitly use 64bit floats and ints

d57dfc1

tdamsma mentioned this pull request Feb 10, 2019

Split Excel IO Into Sub-Directory #25153

Merged

tdamsma added 3 commits February 11, 2019 20:46

Merge commit '6359bbc4c9ce6dd05bc8b422641cda74871cde43' into openpyxl…

e984f6b

…-reader # Conflicts: # pandas/io/excel.py Manually migrated changes to _openpyxl.py

formatting

44f7af2

skip TestOpenpyxlReader when openpyxl is not installed

98d3865

WillAyd requested changes Feb 11, 2019

View reviewed changes

WillAyd added this to the 0.25.0 milestone Feb 11, 2019

Attempt to generalize _XlrdReader __init__ and move it to _BaseExcelR…

d0188ba

…eader

tdamsma added 2 commits February 20, 2019 09:41

Merge commit 'f4568fd76e864d8aee3d23f5a81302262d6e0dcb' into openpyxl…

205d52b

…-reader

register openpyxl writer engine, fix imports

7b550bf

tdamsma and others added 2 commits June 28, 2019 14:30

revert test_reader changes again. Not needed anymore because of using…

6258e59

… openpyxl in read_only mode

more types and whitespace cleanup

00f34b1

jreback reviewed Jun 28, 2019

View reviewed changes

jreback approved these changes Jun 28, 2019

View reviewed changes

jreback requested changes Jun 28, 2019

View reviewed changes

Added config for excel reader. Not sure how to test this

a1fba90

WillAyd added 2 commits June 28, 2019 09:28

whatsnew

88ee325

Merge remote-tracking branch 'upstream/master' into openpyxl-reader

837ce26

Regenerated test1 files

dddc8c5

WillAyd mentioned this pull request Jun 28, 2019

Clean Up Testing of Excel Engine Option Setting for Readers/Writers #27096

Closed

jreback approved these changes Jun 28, 2019

View reviewed changes

WillAyd approved these changes Jun 28, 2019

View reviewed changes

WillAyd merged commit 1be0561 into pandas-dev:master Jun 28, 2019

Themanwithoutaplan reviewed Jun 28, 2019

View reviewed changes

simonjayhawkins mentioned this pull request Jun 29, 2019

TST: Decoupled more xlrd reading tests from openpyxl #27114

Merged

4 tasks

WillAyd mentioned this pull request Jul 7, 2019

Enhancement: XLSB support in read_excel() #8540

Closed

Openpyxl engine for reading excel files #25092

Openpyxl engine for reading excel files #25092

Conversation

tdamsma commented Feb 2, 2019 • edited by WillAyd Loading

pep8speaks commented Feb 2, 2019 • edited Loading

Comment last updated at 2019-06-28 14:48:52 UTC

codecov bot commented Feb 2, 2019

Codecov Report

codecov bot commented Feb 2, 2019 • edited Loading

Codecov Report

tdamsma commented Feb 2, 2019

jreback commented Feb 2, 2019

tdamsma commented Feb 2, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdamsma Feb 12, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdamsma commented Feb 12, 2019

Choose a reason for hiding this comment

jreback commented Jun 28, 2019

jreback left a comment

Choose a reason for hiding this comment

tdamsma commented Jun 28, 2019

jreback commented Jun 28, 2019

WillAyd commented Jun 28, 2019

tdamsma commented Jun 28, 2019

WillAyd commented Jun 28, 2019

WillAyd commented Jun 28, 2019

WillAyd commented Jun 28, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdamsma commented Jun 28, 2019

cjw296 commented Aug 12, 2019

jreback commented Aug 12, 2019

Maverick494 commented Aug 12, 2019 • edited Loading

miroli commented Sep 16, 2019

cjw296 commented Sep 16, 2019

tdamsma commented Feb 2, 2019 •

edited by WillAyd

Loading

pep8speaks commented Feb 2, 2019 •

edited

Loading

codecov bot commented Feb 2, 2019 •

edited

Loading

tdamsma Feb 12, 2019 •

edited

Loading

Maverick494 commented Aug 12, 2019 •

edited

Loading