-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Openpyxl engine for reading excel files #25092
Conversation
Codecov Report
@@ Coverage Diff @@
## master #25092 +/- ##
==========================================
- Coverage 92.37% 42.88% -49.5%
==========================================
Files 166 166
Lines 52420 52420
==========================================
- Hits 48423 22479 -25944
- Misses 3997 29941 +25944
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #25092 +/- ##
===========================================
- Coverage 91.71% 41.13% -50.59%
===========================================
Files 178 178
Lines 50771 50931 +160
===========================================
- Hits 46567 20949 -25618
- Misses 4204 29982 +25778
Continue to review full report at Codecov.
|
@WillAyd This is still a work in progress, but feedback would be welcome. I had to make a few adjustments to the tests to be able to run them for a different engine. |
I think i mentioned this before would take a precursor PR to do this |
oops, think I missed that. That would certainly help as the file is huge. |
pandas/tests/io/test_excel.py
Outdated
@@ -119,11 +134,11 @@ def get_exceldf(self, basename, ext, *args, **kwds): | |||
class ReadingTestsBase(SharedItems): | |||
# This is based on ExcelWriterBase | |||
|
|||
@pytest.fixture(autouse=True, params=['xlrd', None]) | |||
def set_engine(self, request): | |||
@pytest.fixture(autouse=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the use of engine as a parametrized fixture, in favour of setting it as a class attribute of derived, engine specific, classes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be reverted and just parametrize on openpyxl
as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still want to revert this. We prefer parametrization to creating subclasses for testing, and this also had the side effect of reducing coverage on None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, will give it a go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not clear on what the point of removing this was - we moved away from subclasses previously so this takes us backwards. This also removes coverage for engine=None
- what issue was this causing that required reverting to subclasses?
pandas/tests/io/test_excel.py
Outdated
|
||
if (url_table.columns[0] not in local_table.columns | ||
and url_table.columns[0] == local_table.columns[0]): | ||
pytest.skip('?!? what is going on here?') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea what is going on here. Both columns are equal, but one is not in a list that contains the other?
…-reader # Conflicts: # pandas/io/excel.py Manually migrated changes to _openpyxl.py
pandas/tests/io/test_excel.py
Outdated
@@ -119,11 +134,11 @@ def get_exceldf(self, basename, ext, *args, **kwds): | |||
class ReadingTestsBase(SharedItems): | |||
# This is based on ExcelWriterBase | |||
|
|||
@pytest.fixture(autouse=True, params=['xlrd', None]) | |||
def set_engine(self, request): | |||
@pytest.fixture(autouse=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be reverted and just parametrize on openpyxl
as well
pandas/tests/io/test_excel.py
Outdated
@@ -49,6 +49,21 @@ def ignore_xlrd_time_clock_warning(): | |||
yield | |||
|
|||
|
|||
@contextlib.contextmanager | |||
def ignore_openpyxl_unknown_extension_warning(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Openpyxl does not support all elements that are in xlsx sheets, and raises warnings for when these are found. See also https://bitbucket.org/openpyxl/openpyxl/issues/537/userwarning-unknown-extension-is-not and https://stackoverflow.com/questions/34322231/python-2-7-openpyxl-userwarning.
This code suppresses that warning as the tests are making assertions about which warnings should be raised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this? I think from the cleanup PR you had we should have done away with unknown extensions, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unknown extensions refer to excel internals like Conditional Formatting. That is and excel extension that openpyxl does not support. Perhaps I can implement a more generic warnings filter as a fixture
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know which files were actually causing this? If conditional formatting is what is causing this I'm hesitant to blanket add this as I would think issues would be limited to a handful of files at most
@WillAyd I somehow can't comment inline on
but I really see how this should work with how the tests are set-up now. As i see it the Of course the writer tests follow a different pattern This leads to a pretty complicated decorator like
and then every set of engine specific tests need redefine which extension and engine combinations it should run for. I guess judging from the naming ( |
… openpyxl in read_only mode
doc/source/whatsnew/v0.25.0.rst
Outdated
@@ -133,6 +133,7 @@ Other Enhancements | |||
- :meth:`DataFrame.describe` now formats integer percentiles without decimal point (:issue:`26660`) | |||
- Added support for reading SPSS .sav files using :func:`read_spss` (:issue:`26537`) | |||
- Added new option ``plotting.backend`` to be able to select a plotting backend different than the existing ``matplotlib`` one. Use ``pandas.set_option('plotting.backend', '<backend-module>')`` where ``<backend-module`` is a library implementing the pandas plotting API (:issue:`14130`) | |||
- :func:`read_excel` can now use openpyxl to read Excel files via the ``engine='openpyxl'`` argument. This will become the default in a future release (:issue:`11499`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would put openpyxl in double-backquotes, but if this is the only issue, then can later
lgtm. @WillAyd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually we need to add options for this in config_init.py; we have writers, need to add this to readers (and actually test this). I think this should be in this PR.
@WillAyd, there are still some situations where we need a context manager for openpyxl tests. And as this is in the same places we also need the xlrd context manager (because this is to suppress engine related warnings to be able to assert pandas warnings are raised), the tests get really ugly (nested context managers). That is why I proposed to combine the warning suppression for both engines, but as you disgaree, do you have a, alternative? |
tests are here for the writers: https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/excel/test_writers.py#L224 you can do something similar for the readers |
@tdamsma I'd rather not just blanket catch everything in one context manager because it makes it harder to decouple down the road. Taking a look now |
@jreback, th
Thanks, probably will have to wait till Monday though |
@tdamsma rather than catch the openpyxl warning I just regenerated the Excel files causing them. Whatever was throwing that is not relevant to the failing test and probably just an artifact of an old file conversion |
@jreback as mentioned in person I want to clean up the testing of option setting even on the writing side (_WriterBase) so would rather tackle the request to do something similar for reading in a follow up PR |
@tdamsma thanks for all your help here! |
def load_workbook(self, filepath_or_buffer: FilePathOrBuffer): | ||
from openpyxl import load_workbook | ||
return load_workbook(filepath_or_buffer, | ||
read_only=True, data_only=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, you almost certainly want keep_links=False
in there as well. On files that include data from other workbooks, Excel creates caches of the relevant worksheets and openpyxl preserves them by default. These can be pretty big and are read into memory and almost certainly irrelevant for Pandas.from_excel(). For an example see openpyxl bug #494.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your patience! |
What's the state of this? I see it's merged but is it in a release? Still getting people opening pandas issues against xlrd for xlsx files: https://github.com/python-excel/xlrd/issues/355 |
0.25.0 |
I have 0.25.0 and it is telling me that I need xlrd to read_excel. ImportError: Missing optional dependency 'xlrd'. Install xlrd >= 1.0.0 for Excel support Use pip or conda to install xlrd. pip list Even if I install it the result is: File "d:\python\python37\lib\site-packages\pandas\io\excel_base.py", line 356, in init UnsupportedOperation: seek |
@Maverick494 It has yet to show up in the documentation, but it's now possible to specify openpyxl as the engine, i.e |
Please can you change the default to be openpyxl? |
git diff upstream/master -u -- "*.py" | flake8 --diff
Builds upon #26233, so that should be merged first.
This PR adds an almost fully compatible OpenPyXL based engine to excel_read, next to the xlrd engine.
Using OpenPyXL will also allow reading and writing in excel tables referenced by name which I intend to put in a future PR, see also discussion of #24862.
As the current xlrd implementation is pretty tightly coupled to the the text based parsing options, there are many supported keywords in excel_read that I think eventually do not belong there. With this PR I left all of them in and support them as much as possible even though this:
The way this is implemented is that the excel file is read into a dataframe early, and all the keywords like
convert
,dtype
,convert_float
etc are dealt with by modifying the dataframe. Because of this some of the parser code needed to be re-implemented, so this is not ideal. An alternative would be to just not support these keywords at all for the openpyxl engine.