Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Openpyxl engine for reading excel files #25092

Merged
merged 87 commits into from
Jun 28, 2019

Conversation

tdamsma
Copy link
Contributor

@tdamsma tdamsma commented Feb 2, 2019

Builds upon #26233, so that should be merged first.

This PR adds an almost fully compatible OpenPyXL based engine to excel_read, next to the xlrd engine.

Using OpenPyXL will also allow reading and writing in excel tables referenced by name which I intend to put in a future PR, see also discussion of #24862.

As the current xlrd implementation is pretty tightly coupled to the the text based parsing options, there are many supported keywords in excel_read that I think eventually do not belong there. With this PR I left all of them in and support them as much as possible even though this:

  • Make the code much more complicated
  • Makes the code slower

The way this is implemented is that the excel file is read into a dataframe early, and all the keywords like convert, dtype, convert_float etc are dealt with by modifying the dataframe. Because of this some of the parser code needed to be re-implemented, so this is not ideal. An alternative would be to just not support these keywords at all for the openpyxl engine.

@pep8speaks
Copy link

pep8speaks commented Feb 2, 2019

Hello @tdamsma! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-06-28 14:48:52 UTC

@codecov
Copy link

codecov bot commented Feb 2, 2019

Codecov Report

Merging #25092 into master will decrease coverage by 49.49%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25092      +/-   ##
==========================================
- Coverage   92.37%   42.88%   -49.5%     
==========================================
  Files         166      166              
  Lines       52420    52420              
==========================================
- Hits        48423    22479   -25944     
- Misses       3997    29941   +25944
Flag Coverage Δ
#multiple ?
#single 42.88% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/core/categorical.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.35%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-95.46%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.17%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.15%) ⬇️
pandas/core/tools/numeric.py 10.44% <0%> (-89.56%) ⬇️
... and 123 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bb43726...821fa4d. Read the comment docs.

@codecov
Copy link

codecov bot commented Feb 2, 2019

Codecov Report

Merging #25092 into master will decrease coverage by 50.58%.
The diff coverage is 14.81%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #25092       +/-   ##
===========================================
- Coverage   91.71%   41.13%   -50.59%     
===========================================
  Files         178      178               
  Lines       50771    50931      +160     
===========================================
- Hits        46567    20949    -25618     
- Misses       4204    29982    +25778
Flag Coverage Δ
#multiple ?
#single 41.13% <14.81%> (-0.15%) ⬇️
Impacted Files Coverage Δ
pandas/io/excel/_openpyxl.py 12.97% <14.37%> (-71.74%) ⬇️
pandas/io/excel/_base.py 28.5% <50%> (-63.32%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/plotting/_matplotlib/__init__.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.37%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.16%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.1%) ⬇️
... and 134 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d47fc0c...614d972. Read the comment docs.

@tdamsma
Copy link
Contributor Author

tdamsma commented Feb 2, 2019

@WillAyd This is still a work in progress, but feedback would be welcome. I had to make a few adjustments to the tests to be able to run them for a different engine.
As openpyxl is pretty smart in parsing the excel files to data I decided to not use the TextParser of PythonParser, but (re-implement) some if the api specifically for the OpenPyXL reader. Not sure if this is really desirable, but as the parsers all seem to be meant to deal with text files this felt like the best approach. On my local machine 70 of the 78 tests for openpyxl reader are passing

@jreback
Copy link
Contributor

jreback commented Feb 2, 2019

I think i mentioned this before
but would like to split up the excel.py first to make a subdir then each reader / writer can go in slimmer files

would take a precursor PR to do this

@tdamsma
Copy link
Contributor Author

tdamsma commented Feb 2, 2019

oops, think I missed that. That would certainly help as the file is huge.

@jreback jreback added the IO Excel read_excel, to_excel label Feb 2, 2019
@@ -119,11 +134,11 @@ def get_exceldf(self, basename, ext, *args, **kwds):
class ReadingTestsBase(SharedItems):
# This is based on ExcelWriterBase

@pytest.fixture(autouse=True, params=['xlrd', None])
def set_engine(self, request):
@pytest.fixture(autouse=True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the use of engine as a parametrized fixture, in favour of setting it as a class attribute of derived, engine specific, classes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be reverted and just parametrize on openpyxl as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still want to revert this. We prefer parametrization to creating subclasses for testing, and this also had the side effect of reducing coverage on None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will give it a go

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not clear on what the point of removing this was - we moved away from subclasses previously so this takes us backwards. This also removes coverage for engine=None - what issue was this causing that required reverting to subclasses?


if (url_table.columns[0] not in local_table.columns
and url_table.columns[0] == local_table.columns[0]):
pytest.skip('?!? what is going on here?')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea what is going on here. Both columns are equal, but one is not in a list that contains the other?

pandas/io/excel/_openpyxl.py Outdated Show resolved Hide resolved
pandas/io/excel/_base.py Show resolved Hide resolved
@@ -119,11 +134,11 @@ def get_exceldf(self, basename, ext, *args, **kwds):
class ReadingTestsBase(SharedItems):
# This is based on ExcelWriterBase

@pytest.fixture(autouse=True, params=['xlrd', None])
def set_engine(self, request):
@pytest.fixture(autouse=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be reverted and just parametrize on openpyxl as well

@@ -49,6 +49,21 @@ def ignore_xlrd_time_clock_warning():
yield


@contextlib.contextmanager
def ignore_openpyxl_unknown_extension_warning():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this?

Copy link
Contributor Author

@tdamsma tdamsma Feb 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Openpyxl does not support all elements that are in xlsx sheets, and raises warnings for when these are found. See also https://bitbucket.org/openpyxl/openpyxl/issues/537/userwarning-unknown-extension-is-not and https://stackoverflow.com/questions/34322231/python-2-7-openpyxl-userwarning.

This code suppresses that warning as the tests are making assertions about which warnings should be raised.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this? I think from the cleanup PR you had we should have done away with unknown extensions, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unknown extensions refer to excel internals like Conditional Formatting. That is and excel extension that openpyxl does not support. Perhaps I can implement a more generic warnings filter as a fixture

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know which files were actually causing this? If conditional formatting is what is causing this I'm hesitant to blanket add this as I would think issues would be limited to a handful of files at most

@WillAyd WillAyd added this to the 0.25.0 milestone Feb 11, 2019
@tdamsma
Copy link
Contributor Author

tdamsma commented Feb 12, 2019

@WillAyd I somehow can't comment inline on

This should be reverted and just parametrize on openpyxl as well

but I really see how this should work with how the tests are set-up now. As i see it the TestXlrdReader is responsible for running the ReadingTestsBase tests with the 'xlrd' engine. So I added a TestOpenpyxlReader class for running the openpyxl tests. But as the subclasses are responsible for defining the engine, there is no need for the baseclass to parametrize the engine. In fact this would cause the xlrd specific tests to be run with the openpyxl engine?

Of course the writer tests follow a different pattern This leads to a pretty complicated decorator like

@pytest.mark.parametrize("engine,ext", [
    pytest.param('openpyxl', '.xlsx', marks=pytest.mark.skipif(
        not td.safe_import('openpyxl'), reason='No openpyxl')),
    pytest.param('openpyxl', '.xlsm', marks=pytest.mark.skipif(
        not td.safe_import('openpyxl'), reason='No openpyxl')),
    pytest.param('xlwt', '.xls', marks=pytest.mark.skipif(
        not td.safe_import('xlwt'), reason='No xlwt')),
    pytest.param('xlsxwriter', '.xlsx', marks=pytest.mark.skipif(
        not td.safe_import('xlsxwriter'), reason='No xlsxwriter'))
])

and then every set of engine specific tests need redefine which extension and engine combinations it should run for. I guess judging from the naming (ReadingTestsBase vs _WriterBase) the sets of read and write tests follow different patterns anyway

@@ -133,6 +133,7 @@ Other Enhancements
- :meth:`DataFrame.describe` now formats integer percentiles without decimal point (:issue:`26660`)
- Added support for reading SPSS .sav files using :func:`read_spss` (:issue:`26537`)
- Added new option ``plotting.backend`` to be able to select a plotting backend different than the existing ``matplotlib`` one. Use ``pandas.set_option('plotting.backend', '<backend-module>')`` where ``<backend-module`` is a library implementing the pandas plotting API (:issue:`14130`)
- :func:`read_excel` can now use openpyxl to read Excel files via the ``engine='openpyxl'`` argument. This will become the default in a future release (:issue:`11499`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would put openpyxl in double-backquotes, but if this is the only issue, then can later

@jreback
Copy link
Contributor

jreback commented Jun 28, 2019

lgtm. @WillAyd

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually we need to add options for this in config_init.py; we have writers, need to add this to readers (and actually test this). I think this should be in this PR.

@tdamsma
Copy link
Contributor Author

tdamsma commented Jun 28, 2019

@WillAyd, there are still some situations where we need a context manager for openpyxl tests. And as this is in the same places we also need the xlrd context manager (because this is to suppress engine related warnings to be able to assert pandas warnings are raised), the tests get really ugly (nested context managers). That is why I proposed to combine the warning suppression for both engines, but as you disgaree, do you have a, alternative?

@jreback
Copy link
Contributor

jreback commented Jun 28, 2019

tests are here for the writers: https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/excel/test_writers.py#L224

you can do something similar for the readers

@WillAyd
Copy link
Member

WillAyd commented Jun 28, 2019

@tdamsma I'd rather not just blanket catch everything in one context manager because it makes it harder to decouple down the road. Taking a look now

@tdamsma
Copy link
Contributor Author

tdamsma commented Jun 28, 2019

@jreback, th

tests are here for the writers: https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/excel/test_writers.py#L224

you can do something similar for the readers

Thanks, probably will have to wait till Monday though

@WillAyd
Copy link
Member

WillAyd commented Jun 28, 2019

@tdamsma rather than catch the openpyxl warning I just regenerated the Excel files causing them. Whatever was throwing that is not relevant to the failing test and probably just an artifact of an old file conversion

@WillAyd
Copy link
Member

WillAyd commented Jun 28, 2019

@jreback as mentioned in person I want to clean up the testing of option setting even on the writing side (_WriterBase) so would rather tackle the request to do something similar for reading in a follow up PR

@WillAyd WillAyd merged commit 1be0561 into pandas-dev:master Jun 28, 2019
@WillAyd
Copy link
Member

WillAyd commented Jun 28, 2019

@tdamsma thanks for all your help here!

def load_workbook(self, filepath_or_buffer: FilePathOrBuffer):
from openpyxl import load_workbook
return load_workbook(filepath_or_buffer,
read_only=True, data_only=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, you almost certainly want keep_links=False in there as well. On files that include data from other workbooks, Excel creates caches of the relevant worksheets and openpyxl preserves them by default. These can be pretty big and are read into memory and almost certainly irrelevant for Pandas.from_excel(). For an example see openpyxl bug #494.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdamsma
Copy link
Contributor Author

tdamsma commented Jun 28, 2019

@tdamsma thanks for all your help here!

Thank you for your patience!

@cjw296
Copy link
Contributor

cjw296 commented Aug 12, 2019

What's the state of this? I see it's merged but is it in a release? Still getting people opening pandas issues against xlrd for xlsx files: https://github.com/python-excel/xlrd/issues/355

@jreback
Copy link
Contributor

jreback commented Aug 12, 2019

0.25.0

@Maverick494
Copy link

Maverick494 commented Aug 12, 2019

I have 0.25.0 and it is telling me that I need xlrd to read_excel.

ImportError: Missing optional dependency 'xlrd'. Install xlrd >= 1.0.0 for Excel support Use pip or conda to install xlrd.

pip list
pandas 0.25.0

Even if I install it the result is:

File "d:\python\python37\lib\site-packages\pandas\io\excel_base.py", line 356, in init
filepath_or_buffer.seek(0)

UnsupportedOperation: seek

@miroli
Copy link

miroli commented Sep 16, 2019

@Maverick494 It has yet to show up in the documentation, but it's now possible to specify openpyxl as the engine, i.e pd.read_excel(<path>, engine='openpyxl'). Works for me with pandas 0.25.1.

@cjw296
Copy link
Contributor

cjw296 commented Sep 16, 2019

Please can you change the default to be openpyxl?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for openpyxl when reading XLSX (Excel 2010) files
8 participants