Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output excel table objects with to_xlsx() #24862

Closed
tdamsma opened this issue Jan 21, 2019 · 16 comments
Closed

Output excel table objects with to_xlsx() #24862

tdamsma opened this issue Jan 21, 2019 · 16 comments
Labels
Enhancement IO Excel read_excel, to_excel

Comments

@tdamsma
Copy link
Contributor

tdamsma commented Jan 21, 2019

Currently pandas can quickly write a dataframe to an excel sheet. However the output is a plain workbook with a bunch of values, and not the much more powerful excel table object. Excel table objects are very useful because they allow referencing columns/cells by header name instead of $A$23, better filtering, sorting, formatting, pivoting, plotting etc. Of course this can be achieved by selecting the cells output by Pandas and use the Format as Table functionality, but why not support this out of the box?

With XlsxWriter this functionality is fully supported , so it should not be too hard to implement. I would be willing to make a PR if there is any interest, though I might need some guidance as I am not familiar with the Pandas code base.

@tdamsma tdamsma changed the title Output formatted excel table with to_xlsx() Output excel table objects with to_xlsx() Jan 21, 2019
@chris-b1
Copy link
Contributor

I think this would be a reasonable enhancement - probably via some kind of parameter to_excel(..., format='cells|table') ?

I'd start around here for the existing Excel logic. You can ping on here with questions or also feel free to put up a WIP PR.

def write(self, writer, sheet_name='Sheet1', startrow=0,

There are probably some annoying corner cases with MultiIndex that don't fit into the Table format - can probably raise on those.

@chris-b1 chris-b1 added the IO Excel read_excel, to_excel label Jan 21, 2019
@tdamsma
Copy link
Contributor Author

tdamsma commented Jan 21, 2019

I am trying to bolt on this functionality as per @chris-b1 suggestion with an extra keyword to the already very heave to_excel(... function. Problem is that I can't just apply some extra formatting somewhere deep down in the writer function, as the table constructor needs to know if the header should be included, and if so what the column names are etc. So there is a bit of a mismatch with the current implementation. Also, a lot of the functionality to merge cells (and perhaps even to apply custom formatting on a per cell basis) would be a bit superfluous. ow about a separate method to_excel_table(... which could have a separate set of options from the to_excel function?

For an idea of the options we could support, here is a snippet for OpenPyXL (full example here:

# add column headings. NB. these must be strings
ws.append(["Fruit", "2011", "2012", "2013", "2014"])
for row in data:
    ws.append(row)

tab = Table(displayName="Table1", ref="A1:E5")

# Add a default style with striped rows and banded columns
style = TableStyleInfo(name="TableStyleMedium9", showFirstColumn=False,
                       showLastColumn=False, showRowStripes=True, showColumnStripes=True)

And from XlsxWriter

worksheet.add_table('B3:F7', {'data': data,
                              'style': 'Table Style Light 11'})

XlsxWriter supports the following keywords:

autofilter
header_row
banded_columns
banded_rows
first_column
last_column
style
total_row
columns
name

@WillAyd
Copy link
Member

WillAyd commented Jan 21, 2019

Without having looked to deeply it feels like toggling this via a keyword argument would still be the better way to go, as a separate function would I think end up duplicating a lot of functionality. With that said, if you see a better way of going about it always open to ideas - probably just best to push a PR and have it reviewed in that case

@WillAyd WillAyd added this to the Contributions Welcome milestone Jan 21, 2019
@jreback
Copy link
Contributor

jreback commented Jan 21, 2019

going to be -1 on a to_excel_table

this certainly should be done via a keyword

@tdamsma
Copy link
Contributor Author

tdamsma commented Jan 22, 2019

Clear, I'll give it a go and see where it leads

tdamsma added a commit to tdamsma/pandas that referenced this issue Jan 24, 2019
@tdamsma tdamsma mentioned this issue Jan 24, 2019
4 tasks
@tdamsma
Copy link
Contributor Author

tdamsma commented Jan 24, 2019

@chris-b1, I put up the WIP PR.

For discussion, which part of the current to_excel api should be supported? This is my suggestion

  • excel_writer ✔️
  • sheet_name ✔️
  • na_rep ✔️
  • float_format ✔️
  • columns ✔️
  • header ✔️
  • index ✔️(Every column needs a name though, if the header is included. I currently implemented defaults similar to what Excel does: Column0, Column1)
  • index_label ✔️
  • startrow ✔️ (Though it I think it is bad practice to put more than one table on a sheet, staring at cell A1, and would not mind discouraging any other use)
  • startcol ✔️
  • engine Only OpenPyXL and XlsxWriter
  • merge_cells Can not be supported
  • encoding ✔️️
  • inf_rep ✔️
  • verbose ✔️
  • freeze_panes Am not sure if the really makes sense, as the header rows is automatically semi frozen (the column names A, B etc are replaced with the header names) when a table is used and the header is scrolled out of the window.

Proposed:

  • as_table I like this better than a format keyword
  • table_name Think this would be desirable to add

And then there are many other other options that could be supported. think it would be better to leave that out though:

  • autofilter
  • banded_columns
  • banded_rows
  • first_column
  • last_column
  • style
  • total_row

tdamsma added a commit to tdamsma/pandas that referenced this issue Jan 27, 2019
@tdamsma
Copy link
Contributor Author

tdamsma commented Jan 30, 2019

@WillAyd, @jreback, @chris-b1 To effectively test the proposed as_table functionality, there needs to be a function the reads an excel table back into a dataframe, see also discussion of #24899. I came up with the following, which can read all tables in a workbook (regardless of which workbook they are in) into separate dataframes.

from openpyxl import load_workbook
from pandas.core.dtypes.common import is_list_like
from pandas.core.frame import DataFrame
def read_excel_tables(io, table_name=None, index_col='auto'):
    """Read an Excel table into a pandas dataframe.
    
    Only supports xlsx files.
    
    Parameters
    ----------
    io : str, file descriptor or pathlib.Path
    table_name : str or None, default None
        Strings are used for table names. Specify None to get all 
        tables in a dict of dataframes. 
    index_col: int, list of int, None or 'auto', Default 'auto'
        Column (0-indexed) to use as the row labels of the DataFrame.
        Pass None if there is no such column.  If a list is passed,
        those columns will be combined into a ``MultiIndex``.
        'auto' will determine if there is an index column from the table
        `First Column` option in Excel
    """
    def get_tables(io):
        # unfortunately tables are only parsed in the slower write mode
        wbk = load_workbook(io, data_only=True, read_only=False)
        tables = {}
        for wks in wbk:
            for t in wks._tables:
                tables[t.name] = dict(table=t, wks=wks)
        return tables

    def read_table(table, wks, index_col):
        columns = [col.name for col in table.tableColumns]
        data_rows = wks[table.ref][
            (table.headerRowCount or 0):
            -table.totalsRowCount if table.totalsRowCount is not None else None]
        data = [[cell.value for cell in row] for row in data_rows]
        frame = DataFrame(data, columns=columns, index=None)
        if index_col:
            if index_col == 'auto':
                if table.tableStyleInfo.showFirstColumn:
                    frame.set_index(columns[0])
            elif is_list_like(index_col):
                frame = frame.set_index([columns[i] for i in index_col])
            else:
                frame = frame.set_index(columns[index_col])
        return frame

    tables = get_tables(io)
    if table_name is not None:
        return read_table(**tables[table_name], index_col=index_col)
    else:
        return {k: read_table(**v, index_col=index_col) for k, v in tables.items()}

I think it would make perfect sense to also include this in pandas, however not really sure where to place that code. There is currently only an xlrd implementation for and excel reader, but xlrd does not support reading tables.

So then I have the following questions:

  • Would it make sense to implement a read table functionality first before a write table option so it is easier to test?
  • Should that be a separate PR?
  • Would an OpenPyXL reader need to support the full (and extensive) read_excel api?
  • Should reading an excel table be a keyword to read_excel? As an Excel table referenced by name already implies a sheet name, if there is a header, column names, index etc.

@WillAyd
Copy link
Member

WillAyd commented Jan 31, 2019

  • Would it make sense to implement a read table functionality first before a write table option so it is easier to test?

Makes sense to me!

  • Should that be a separate PR?

Yep

  • Would an OpenPyXL reader need to support the full (and extensive) read_excel api?

I would say yes; I don't think it makes sense to implement an engine for a small subset of functionality as it just makes our API more confusing. Note that there has been a decent amount of refactoring going on to better support community engagement on this. There's also an open issue #11499 for this.

I would think if anything this is the most logical starting point to just get the reader up and working. From there you could add table support (I think better as a keyword argument in read_excel instead of a separate function) and from there could start working on writing to get the entire roundtripping. Again just my $.02 and for sure easier said than done, but I would welcome any PRs towards that effort

  • Should reading an excel table be a keyword to read_excel? As an Excel table referenced by name already implies a sheet name, if there is a header, column names, index etc.

Yea I think a keyword argument makes the most sense as it keeps the API simplest and you could leverage the existing functionality of other applicable parameters

@tdamsma
Copy link
Contributor Author

tdamsma commented Jan 31, 2019

Didn't know that supporting more readers was an ongoing development. Are you still working on that? Also, just out of curiosity, what is the rationale for supporting multiple excel libraries?

@WillAyd
Copy link
Member

WillAyd commented Jan 31, 2019

I haven't actually put any code into it, just be reorganizing things in hopes of better community engagement.

As far as supporting multiple libraries goes the libraries support different file types. Openpyxl I believe is only .xlsx format, xlrd can handle .xls in addition to .xlsx. Off the top of my head I don't think either support .xlsb and not sure about .xlsm, so having multiple engines gives flexibility to seamlessly deal with different file types amongst potential other optimizations available to the end user

@tdamsma
Copy link
Contributor Author

tdamsma commented Feb 1, 2019

I'll give it a go then to make an OpenPyXL reader. I see there are pretty extensive tests for read_excel and I can just add an extra engine param, so hopefully that should be pretty straightforward.

tdamsma added a commit to tdamsma/pandas that referenced this issue Jul 9, 2019
# Conflicts:
#	pandas/core/generic.py
#	pandas/io/excel.py
#	pandas/io/formats/excel.py
#	pandas/tests/io/test_excel.py
@WillAyd
Copy link
Member

WillAyd commented Jul 9, 2019

I'm wondering if just having a table= keyword in to_excel would work here? The advantage of that (at least in the openpyxl world) is that we could let third party libraries take care of table formatting (like via TableStyleInfo) and get rid of the ExcelFormatter stuff we have somewhat half-baked into pandas

cc @Themanwithoutaplan for thoughts

@Themanwithoutaplan
Copy link
Contributor

I'm wondering if just having a table= keyword in to_excel would work here? The advantage of that (at least in the openpyxl world) is that we could let third party libraries take care of table formatting (like via TableStyleInfo) and get rid of the ExcelFormatter stuff we have somewhat half-baked into pandas

cc @Themanwithoutaplan for thoughts
Would it need anything else from openpyxl? In general I find the signatures of these handy-but-kitchen-sink functions already very extensive and, as long as you're not in write-only mode, the workbook can be changed subsequently by the user, which I think is preferable.

@tdamsma
Copy link
Contributor Author

tdamsma commented Jul 9, 2019

I just merged the master to get a feel again for where I left off. Seems to work, but still a lot of loose ends to be discussed. @WillAyd, perhaps you could reopen #24899? Or maybe keep the discussion here for the moment. In the meantime I'll try to see if I can get a table keyword in the read_excel function. Should reading and writing tables be the same pr?

@tdamsma
Copy link
Contributor Author

tdamsma commented Oct 19, 2021

I bit the bullet and created a dedicated package to read and write properly formatted excel tables with pandas: https://github.com/VanOord/pandas-xlsx-tables

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@mroeschke
Copy link
Member

Appears there hasn't been much activity or community support for this feature in a while so closing. Happy to reopen if there's renewed support

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants