Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PrettyTables.jl as an alternative backend for display in DataFrames.jl #2337

Closed
bkamins opened this issue Jul 28, 2020 · 38 comments
Closed

Comments

@bkamins
Copy link
Member

bkamins commented Jul 28, 2020

It would be nice to add PrettyTables.jl as an alternative display backend for DataFrames.jl to the one we currently have (probably not to replace it in the short term, as it would be too hard to ensure a smooth transition).

If we want to do it the steps would be the following:

  1. Test package load time impact
  2. choose appropriate defaults for printing
  3. prepare backend switching mechanics (in particular mapping kwargs where appropriate)
  4. ensure all types that currently use custom printing (datata frame, DataFrameRow, GroupedDataFrame, DataFrameRows, DataFrameColumns are correctly handled)
  5. decide if we re-export PrettyTables.jl from DataFrames.jl

CC @ronisbr

@bkamins bkamins added this to the 1.x milestone Jul 28, 2020
@ronisbr
Copy link
Member

ronisbr commented Aug 1, 2020

Just to report the progress, I am trying to, at least, print the common structures just like DataFrames is currently printing. Right now, I am having a problem with undefined references when accessing DataFrames using Tables.row. I can easily solve this if I assume that the table is a DataFrame. However, I want to make something more general that works for any Tables.jl. I asked for help here:

https://discourse.julialang.org/t/help-to-check-if-an-element-in-a-tables-jl-is-defined/44110

After that, I think I just need to add option to adjust the vertical cropping. AFAIK, this would be the last feature I really need to implement. The rest will be only configurations and tweaks using existing interface (I hope!).

@bkamins
Copy link
Member Author

bkamins commented Aug 1, 2020

I have commented on the issue there and for completeness I repost here:

@quinnj - I think for Tables.row result we should add a requirement that isassigned method should be defined for it.

Also in Tables.jl the Base.show(io::IO, x::T) where {T <: RorC} method has show(io, NamedTuple(x)) line which throws an error on #undef.

@pdeffebach
Copy link
Contributor

I don't really see why data frames needs to implement printing at all. Unlike joins etc. where the storage of the columns matters, i think all of this could be relegated to a Tables.jl-type package.

Would be great to excise this huge chunk of code!

@bkamins
Copy link
Member Author

bkamins commented Aug 2, 2020

This is the ultimate objective and this package is PrettyTables.jl. The only problem is that currently DataFrames.jl has some functionality that PrettyTables.jl has to catch up with (but it seems to be very close towards this goal).

Then we will have a choice:

  1. allow switching of backends (safe approach)
  2. drop legacy printing (radical approach)

I would prefer to do 2, but there is a risk that user's code depends on how printing is done currently and it would be very breaking.

So probably the way we will approach it will be in three stages:

  1. add PrettyTables.jl as a secondary backend (experimental)
  2. switch to make PrettyTables.jl a default backend
  3. remove printing code from DataFrames.jl to LegacyDataFramesShow.jl or something like this and allow to optionally load this backend

Anyway - we will have time to discuss this when @ronisbr tells us that he is ready for the stress tests of PrettyTables.jl against:

  show([io::IO,] df::AbstractDataFrame;
       allrows::Bool = !get(io, :limit, false),
       allcols::Bool = !get(io, :limit, false),
       allgroups::Bool = !get(io, :limit, false),
       splitcols::Bool = get(io, :limit, false),
       rowlabel::Symbol = :Row,
       summary::Bool = true,
       eltypes::Bool = true)

@ronisbr
Copy link
Member

ronisbr commented Aug 3, 2020

So probably the way we will approach it will be in three stages:

  1. add PrettyTables.jl as a secondary backend (experimental)
  2. switch to make PrettyTables.jl a default backend
  3. remove printing code from DataFrames.jl to LegacyDataFramesShow.jl or something like this and allow to optionally load this backend

If I can provide an opinion, this is the perfect approach for me! I have been studying the printing system of DataFrames and it is very complex. There are a lot of cases that we can break things easily in the first attempt to move to PrettyTables.jl :)

Anyway - we will have time to discuss this when @ronisbr tells us that he is ready for the stress tests of PrettyTables.jl against:

I am getting there! I found that problem with #undef and some minor things. I just need to reach a point that we need no other feature inside PrettyTables.jl, but a correct configuration for it.

@pdeffebach
Copy link
Contributor

Thanks for the update to make prettytables performant with tables! Thats a lot of progress right there as well.

@bkamins
Copy link
Member Author

bkamins commented Aug 3, 2020

@ronisbr - thank you for all your work on this! I will open an issue for #undef in Tables.jl

@ronisbr
Copy link
Member

ronisbr commented Aug 30, 2020

Ok! Now we can start to do some tests and I need help to tweak everything. I propose to start with the Text backend before trying to make HTML and LaTeX work. Using master branch of PrettyTables.jl and this code snippet:

function h1_f(data,i,j)
    try
        return ismissing(data[i,j]) ||
            data[i,j] == nothing ||
            typeof(data[i,j]) <: Union{AbstractDataFrame, GroupedDataFrame,
                                       DataFrames.DataFrameRow,
                                       DataFrames.DataFrameRows,
                                       DataFrames.DataFrameColumns}
    catch e
        if isa(e, UndefRefError)
            return true
        else
            rethrow(e)
        end
    end
end

h1 = Highlighter(h1_f, crayon"dark_gray")

function f1(v,i,j)
    if typeof(v) <: Union{AbstractDataFrame, GroupedDataFrame,
                          DataFrames.DataFrameRow, DataFrames.DataFrameRows,
                          DataFrames.DataFrameColumns}
        str = sprint(print, v, context = :compact => true)
        str = split(str, '\n')[1]
        return str
    elseif ismissing(v)
        return "missing"
    elseif v == nothing
        return ""
    else
        return v
    end
end

function print_dataframe(df)
    sch = Tables.schema(df)

    names = reshape( [sch.names...], (1,:) )
    types = DataFrames.compacttype.(reshape( [sch.types...], (1,:) ))

    pretty_table(df,
                 vcat(names,types),
                 alignment = :l,
                 formatters = (f1,),
                 highlighters = (h1,),
                 maximum_columns_width = 30,
                 row_number_alignment = :l,
                 show_row_number = true,
                 tf = dataframe)
end

I am managed to get a lot of printing like the default output of DataFrames (I am already limiting the column width size to 30). For example:

julia> df = DataFrame(A=Int64.(1:9), B = Vector{Any}(undef, 9));

julia> df.B[1:8] = [df, # DataFrame
                        df[1,:], # DataFrameRow
                        view(df,1:1, :), # SubDataFrame
                        eachrow(df), # DataFrameColumns
                        eachcol(df), # DataFrameRows
                        groupby(df, :A),missing,nothing] # GroupedDataFrame;

Captura de Tela 2020-08-30 às 15 27 02

julia> df = DataFrame(A = Int64[1:4;], B = ["x\"", "∀ε>0: x+ε>x", "z\$", "A\nC"],
                         C = Float32[1.0, 2.0, 3.0, 4.0]);

Captura de Tela 2020-08-30 às 15 28 05

Hence, I think this is a good baseline to start. Now, we need to tweak the edge cases and, for this, I need help because I do not know what they are :D

@bkamins
Copy link
Member Author

bkamins commented Aug 30, 2020

OK - I will try to list the cases.

@bkamins
Copy link
Member Author

bkamins commented Aug 30, 2020

I think apart from what you covered this would be good to check:

DataFrame(x=[pi,
big(pi),
Rational(float(pi)),
Rational(big(pi)),
:pi,
Symbol(1),
Symbol(""),
Symbol("a\bx"),
"a\bx",
"̂", # note this is " then \hat then <TAB> then "
String([0x7c,0xf8, 0x82, 0x86, 0xe1, 0xf4, 0xc4, 0xa6, 0xd0, 0x40]),
'😄'
])

@bkamins
Copy link
Member Author

bkamins commented Aug 30, 2020

  • please test Markdown values (this is not merged yet but as you know will be added soon)

@ronisbr
Copy link
Member

ronisbr commented Aug 30, 2020

Good! For Markdown, it will not work now. We need to wait @NicholasWMRitchie update his PR because some conflicts.

@ronisbr
Copy link
Member

ronisbr commented Aug 30, 2020

@bkamins for your last example, I am getting this with DataFrames:

Captura de Tela 2020-08-30 às 19 13 07

Is this expected?

@bkamins
Copy link
Member Author

bkamins commented Aug 30, 2020

Yes, and this is incorrect - that is why I have added this example 😄 - this will be fixed when we add a limit of 32 characters to a column width (pending as you know as we are waiting for Markdown PR to finalize).

If you comment out Rational(big(pi)), then it will show correctly (but please also check how the original example renders in PrettyTables.jl).

@bkamins
Copy link
Member Author

bkamins commented Aug 30, 2020

If you had a wider terminal you would see:

12×1 DataFrame
│ Row │ x                                                                                                                                                            │
│     │ Any                                                                                                                                                          │
├─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 1   │ π                                                                                                                                                            │
│ 2   │ 3.14159                                                                                                                                                      │
│ 3   │ 884279719003555//281474976710656                                                                                                                             │
│ 4   │ 45471447111470790535029367847216232831674172166049053744846518889742361808273//14474011154664524427946373126085988481658748083205070504932198000989141204992 │
│ 5   │ pi                                                                                                                                                           │
│ 6   │ 1                                                                                                                                                            │
│ 7   │                                                                                                                                                              │
│ 8   │ a\bx                                                                                                                                                         │
│ 9   │ a\bx                                                                                                                                                         │
│ 10  │ ̂                                                                                                                                                             │
│ 11  │ |\xf8\x82\x86\xe1\xf4Ħ\xd0@                                                                                                                                  │
│ 12  │ '😄'                                                                                                                                                         │

@ronisbr
Copy link
Member

ronisbr commented Aug 30, 2020

Ah, thanks for the info! This is what I get from the code above using PrettyTables.jl.

Captura de Tela 2020-08-30 às 19 25 02

@ronisbr
Copy link
Member

ronisbr commented Aug 30, 2020

hum, I see that the Chars are printed with quotes in DataFrames. This is a different behavior form print, is it the way you want?

@bkamins
Copy link
Member Author

bkamins commented Aug 31, 2020

I think the reason is that DataFrames.jl used show not print as a reference:

julia> show('a')
'a'
julia> show("a")
"a"

and the only thing that was changed is removing " marks for strings and : in front for Symbols.
But In general - I think that this is minor and we can differ here.

Also - just to make sure nothing is broken you might want to check df = DataFrame(x = ["v1", r"1"]).

Can you also show how you display the different element types in the header (and is there a possibility to opt-out of it)?

@ronisbr
Copy link
Member

ronisbr commented Aug 31, 2020

I think the reason is that DataFrames.jl used show not print as a reference:

julia> show('a')
'a'
julia> show("a")
"a"

and the only thing that was changed is removing " marks for strings and : in front for Symbols.
But In general - I think that this is minor and we can differ here.

OK!

Also - just to make sure nothing is broken you might want to check df = DataFrame(x = ["v1", r"1"]).

I think it is working fine:

Captura de Tela 2020-08-31 às 16 08 40

Can you also show how you display the different element types in the header (and is there a possibility to opt-out of it)?

I am using exactly the same function of DataFrames (DataFrames.compacttype). It is very easy to add a keyword to hide the subheaders:

function print_dataframe(df; eltypes::Bool = true)
    sch = Tables.schema(df)

    names = reshape( [sch.names...], (1,:) )
    types = DataFrames.compacttype.(reshape( [sch.types...], (1,:) ))

    pretty_table(df,
                 vcat(names,types),
                 alignment = :l,
                 formatters = (f1,),
                 highlighters = (h1,),
                 maximum_columns_width = 30,
                 nosubheader = !eltypes,
                 row_number_alignment = :l,
                 show_row_number = true,
                 tf = dataframe)
end

Captura de Tela 2020-08-31 às 16 12 20

Now I will check the filters.

@bkamins
Copy link
Member Author

bkamins commented Aug 31, 2020

OK - I meant v"1" not "v1" sorry for mistyping (but I assume it is OK also). Thank you!

@ronisbr
Copy link
Member

ronisbr commented Aug 31, 2020

OK - I meant v"1" not "v1" sorry for mistyping (but I assume it is OK also). Thank you!

NP! In this case, there is a little difference, because print(v"1") is 1.0.0 instead of v"1.0.0".

Captura de Tela 2020-08-31 às 17 33 26

@bkamins
Copy link
Member Author

bkamins commented Sep 1, 2020

Just for my reference - is there some reason why you use print not show? (not saying which should be preferred - I just want to understand)

@bkamins
Copy link
Member Author

bkamins commented Sep 1, 2020

@ronisbr - I think we can use PrettyTables.jl for text/plain for now. LaTeX and HTML are a less of priority as I think what we have now for them in DataFrames.jl is pretty OK.

@ronisbr
Copy link
Member

ronisbr commented Sep 2, 2020

@ronisbr - I think we can use PrettyTables.jl for text/plain for now. LaTeX and HTML are a less of priority as I think what we have now for them in DataFrames.jl is pretty OK.

Good!

I am doing some improvements in Markdown, since it will be available in DataFrames soon. One major accomplished I had today was the capability to render Markdown using the Text backend inside a cell. Thus, we can do things like this:

julia> a = md"""
       # Markdown

       This is a **Markdown** example.

       !!! note

           This is a note.

       This is a URL: [PrettyTables.jl](https://github.com/ronisbr/PrettyTables.jl)
       """;

julia> b = md"""
       This _**is a bold text that will wrap inside a cell**_
       """;

julia> data = [1 a; 2 b];

julia> pretty_table(data, linebreaks = true, hlines = :all, columns_width = [-1,40])

Captura de Tela 2020-09-01 às 22 49 36

I just need to test those things because the logic was somewhat complicated.

@bkamins
Copy link
Member Author

bkamins commented Sep 2, 2020

Fantastic - Markdown (simplified) is already available on DataFrames.jl master, but we do not have to be 100% consistent here with the display, as your approach seems better.

@ronisbr
Copy link
Member

ronisbr commented Sep 9, 2020

Just for my reference - is there some reason why you use print not show? (not saying which should be preferred - I just want to understand)

Sorry! I completely missed this question. Well, when I was creating PrettyTables, it seems better to use print output than show, mainly because of strings since it does not add quotes.

I think we are ready to do some tests! Is it possible to verify the previous code against master of PrettyTables.jl to see if we find rough edges? I would like to correct those before tagging a new version.

@bkamins
Copy link
Member Author

bkamins commented Sep 10, 2020

OK - I will check it. Just please let me know how I should preferably change settings.
Now, I was thinking how to integrate PrettyTables.jl into DataFrames.jl and the general idea is the following:

  1. have a constant that identifies the backend; have functions like setdisplay_traditional and setdisplay_prettytables that switch backends; currently only for text/plain (I am not saying not to support other backends but let us concentrate on a single one )
  2. now have a constant that holds PrettyTables.jl settings with reasonable defaults and then when you call setdisplay_prettytables you can pass arguments to change these defasetdisplay_prettytablesult settings to something else. These are global settings. Also PrettyTables.jl could be conditionally loaded only if setdisplay_prettytables is called (to reduce DataFrames.jl startup time).
  3. finally when the user calls show with some kwargs then they should be translated to PrettyTables.jl settings for this single call (of course if there is a reasonable mapping)
  4. we should add a separate section in the manual describing traditional and PrettyTables.jl settings for DataFrames.jl display

If you agree with this plant then the simplest would be if you did a simple PR to DataFrames.jl with what I write implemented (of course without spending too much time on polishing it - e.g. no documentation updates are needed, or not all kwargs of show need to be mapped) then in this PR add PrettyTables#master as a dependency and all will just work. Then I will be able to test everything the way you imagine things should be used. Of course if this is too complex - just please let me know with what settings I should call PrettyTables.jl.

In particular - one of the major things that we wanted to change in DataFrames.jl and should be easy in PrettyTables.jl by default is to remove the column separator lines in display as they occupy to much horizontal space. This would be a huge win.

@ronisbr
Copy link
Member

ronisbr commented Sep 10, 2020

OK - I will check it. Just please let me know how I should preferably change settings.
Now, I was thinking how to integrate PrettyTables.jl into DataFrames.jl and the general idea is the following:

  1. have a constant that identifies the backend; have functions like setdisplay_traditional and setdisplay_prettytables that switch backends; currently only for text/plain (I am not saying not to support other backends but let us concentrate on a single one )

Ok!

  1. now have a constant that holds PrettyTables.jl settings with reasonable defaults and then when you call setdisplay_prettytables you can pass arguments to change these defasetdisplay_prettytablesult settings to something else. These are global settings. Also PrettyTables.jl could be conditionally loaded only if setdisplay_prettytables is called (to reduce DataFrames.jl startup time).

Ok!

  1. finally when the user calls show with some kwargs then they should be translated to PrettyTables.jl settings for this single call (of course if there is a reasonable mapping)

Ok!

  1. we should add a separate section in the manual describing traditional and PrettyTables.jl settings for DataFrames.jl display

Ok, but I will put this on hold. When everything is decided and good, then I write this section of the manual.

If you agree with this plant then the simplest would be if you did a simple PR to DataFrames.jl with what I write implemented (of course without spending too much time on polishing it - e.g. no documentation updates are needed, or not all kwargs of show need to be mapped) then in this PR add PrettyTables#master as a dependency and all will just work. Then I will be able to test everything the way you imagine things should be used. Of course if this is too complex - just please let me know with what settings I should call PrettyTables.jl.

Nice! I will create this PR with those functionalities you described. Indeed, it will be easier to test.

In particular - one of the major things that we wanted to change in DataFrames.jl and should be easy in PrettyTables.jl by default is to remove the column separator lines in display as they occupy to much horizontal space. This would be a huge win.

You mean, something like this:

Captura de Tela 2020-09-10 às 09 55 42

or like this:

Captura de Tela 2020-09-10 às 09 56 06

@bkamins
Copy link
Member Author

bkamins commented Sep 10, 2020

When everything is decided and good, then I write this section of the manual.

That is my point - to have something minimal now, so that we can test it and make it production ready later

Regarding output I imagined vlines = :none as the default. @nalimilan - do you agree?

Also in DataFrames.jl we show row numbers by default to the left of the data in a data frame - this is something to be discussed if we want it or not (there were mixed opinions on this). I am OK to drop it.

@ronisbr
Copy link
Member

ronisbr commented Sep 10, 2020

When everything is decided and good, then I write this section of the manual.

That is my point - to have something minimal now, so that we can test it and make it production ready later

Good!

Regarding output I imagined vlines = :none as the default. @nalimilan - do you agree?

Also in DataFrames.jl we show row numbers by default to the left of the data in a data frame - this is something to be discussed if we want it or not (there were mixed opinions on this). I am OK to drop it.

I, personally, prefer to display the row numbers. It can be switched by show_row_number = true. Maybe, in this case, we would want to add a vertical line between this column and the table data:

Captura de Tela 2020-09-10 às 10 24 08

@pdeffebach
Copy link
Contributor

+1 on keeping row numbers. It helps when you are sharing a screen with someone or explaining a table.

@bkamins
Copy link
Member Author

bkamins commented Sep 10, 2020

Maybe, in this case, we would want to add a vertical line between this column and the table data:

Yes - it would be good to have a vline there

@bkamins
Copy link
Member Author

bkamins commented Sep 10, 2020

@ronisbr - there is one crucial thing we have just discussed with @nalimilan. Actually if the tests go well, we think that dropping the legacy output that DataFrames.jl uses now could be even considered.

The crucial thing is that as DataFrames.jl is a dependency in dozens of packages then PrettyTables.jl will become such a dependency. In general of course it is good as this is what I guess you want 😄, but it also means that the moment we make this dependency we will pin some version of PrettyTables.jl in [compat] and if you later release breaking versions of PrettyTables.jl then this will have to be separately discussed if this breaking upgrade will be allowed in DataFrames.jl.

In short the question is - how close do you feel you are to 1.0 release of PrettyTables.jl with some guarantees of not having breaking changes often.

@ronisbr
Copy link
Member

ronisbr commented Sep 10, 2020

@ronisbr - there is one crucial thing we have just discussed with @nalimilan. Actually if the tests go well, we think that dropping the legacy output that DataFrames.jl uses now could be even considered.

The crucial thing is that as DataFrames.jl is a dependency in dozens of packages then PrettyTables.jl will become such a dependency. In general of course it is good as this is what I guess you want 😄, but it also means that the moment we make this dependency we will pin some version of PrettyTables.jl in [compat] and if you later release breaking versions of PrettyTables.jl then this will have to be separately discussed if this breaking upgrade will be allowed in DataFrames.jl.

In short the question is - how close do you feel you are to 1.0 release of PrettyTables.jl with some guarantees of not having breaking changes often.

This is actually one thing I was about to tell you. For the text backend, I think I am very, very happy with the API right now. I just need to add the last features to make it fully compatible with the option in DataFrames. That's why I need this test from DataFrames, to see if I missed something.

If you don't mind to make the transition in parts, text backend first, and after HTML and LaTeX, then I think I can tag v1.0 as soon as we know that everything we need is implemented. Inside PrettyTables, I will mark LaTeX and HTML backends as beta with the goal to stabilize them in v2.0.

In the past, I took sometime and released a very breaking version to make all the API more or less uniform. However, if you think that there is some kind of improvement I need to do before 1.0, please, let me know!

@bkamins
Copy link
Member Author

bkamins commented Sep 10, 2020

This is what we have hoped 😄. So let us run the tests now based on your PR and we will synchronize the releases of PrettyTables.jl and DataFrames.jl (and keep HTML/LaTeX for later)

@ronisbr
Copy link
Member

ronisbr commented Sep 10, 2020

Perfect! I will submit this PR very soon.

@bkamins
Copy link
Member Author

bkamins commented Sep 15, 2020

@ronisbr - I have marked all open issues in DataFrames.jl that relate to display topics with "display" label so that you can easily filter out what has been discussed in the past.

@bkamins
Copy link
Member Author

bkamins commented Nov 8, 2020

Done!

@bkamins bkamins closed this as completed Nov 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants