-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make DataFrame.to_string output full content by default #28052
Make DataFrame.to_string output full content by default #28052
Conversation
…ot sure if I'll keep it.
starting to feel a bit uncomfortable about it. The max_colwidth is an important feature for legibility in the vast majority of contexts - and one expects the display config setting to work. It is only when invoked at the highest level as to_string() that it should be unlimited. So even though this is a temp commit, I'm about to unwind it I think and try an approach at the top level only.:
… have a quick override at the very top level, and everything else behaves based on that one override.#
…width is 0 instead of a large number. So I set it to a large number (like the html diff) to preserve the justification behavior.
pandas/core/frame.py
Outdated
@@ -707,11 +708,14 @@ def to_string( | |||
max_cols=None, | |||
show_dimensions=False, | |||
decimal=".", | |||
max_colwidth=9999999, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wanted to note - one of the commenters on the issue asks: "With Pandas 0.25.0, setting display.max_colwidth to a large number stops the truncation but when trying to left justify columns with df.to_string(justify='left'), that same display setting somehow pads columns on the left so they are not left aligned. Is there any present way to prevent truncation and get left justified string columns when output to a terminal? I know a pull request is in process but I would like to do this now. Thanks."
I can accomplish this in my testing by setting max_colwidth=0
, which switches the padding to left. It is weird, though, that passing justify="left"
does not justify it correctly. Seems like maybe a separate bug or one that I could look further into.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this cause, but I don't think we'll want a workaround like this.
Can you post the output with max_colwidth=0
, and how it compares to this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shows how the max_colwidth=0 strangely forces a left justification:
>>> print(df.to_string(max_colwidth=0))
A B
0 NaN NaN
1 -1.0000 foo
2 -2.1234 foooo
3 3.0000 fooooo
4 4.0000 bar
Whereas the behavior in this PR preserves the right justification that is the current default:
>>> print(df.to_string(max_colwidth=99999))
A B
0 NaN NaN
1 -1.0000 foo
2 -2.1234 foooo
3 3.0000 fooooo
4 4.0000 bar
And interestingly, passing justify='left'
doesn't have an effect:
>>> print(df.to_string(justify='left'))
A B
0 NaN NaN
1 -1.0000 foo
2 -2.1234 foooo
3 3.0000 fooooo
4 4.0000 bar
This is because of these lines in _make_fixed_width:
def _make_fixed_width(
...
max_len = max(adj.len(x) for x in strings)
...
def just(x):
if conf_max is not None:
if (conf_max > 3) & (adj.len(x) > max_len):
x = x[: max_len - 3] + "..."
return x
strings = [just(x) for x in strings]
result = adj.justify(strings, max_len, mode=justify)
return result
It checks if conf_max > 3
to apply the dot truncation ... so if it's <= 3 then that isn't called. So it's not just 0 but any of 0, 1, 2, or 3 that causes the justification to line up.
I can spend some more time to better understand why this is happening. I agree that we should not rely on some incidentals of the underlying implementation to determine whether to justify the text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I figured out the issue. There are two lines where format_array
is called and the justify
parameter is not passed all the way through -- so in some places, the justification is being overridden.
Note that the bug where justification doesn't happen if conf_max < 3 already appears - so I think it can probably be pulled out as a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scratch all that. I re-read the docs and I see that I misinterpreted the justify
param, as it only refers to the column headers, not the content. In that regard it is behaving correctly. So I think I'll leave the justification question out of this PR.
…the expected values though because the fixed width will be harder to read if the lines are split, and they kind of have to be long to test the truncation...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
I think we'll need to deprecate the current behavior, rather than just changing the default. Users will need to be explicit about the max_colwidth
for now I think.
pandas/core/frame.py
Outdated
@@ -707,11 +708,14 @@ def to_string( | |||
max_cols=None, | |||
show_dimensions=False, | |||
decimal=".", | |||
max_colwidth=9999999, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this cause, but I don't think we'll want a workaround like this.
Can you post the output with max_colwidth=0
, and how it compares to this?
To make sure I understand, what should happen when the user calls
Sounds like you're suggesting that we continue the current behavior, but with a deprecation warning? Should we only sound the deprecation warning if the data frame contains columns that would have otherwise been truncated? Seems like in most cases, the truncation won't be a difference in behavior and I would hate to make vanilla |
To update: I chose to punt the justify question as I think that's a separate issue that pre-existed and I may not understand the use case that well anyway. I changed the behavior of max_colwidth so that it will use I have not yet added a deprecation warning - happy to do so if you think desired behavior is that the |
That needs to be decided. It seems that in #28052 we decided that to_html not outputting the entire dataframe was a bug. Presumably we would say the same about this, but I'm not sure. |
we should probably maintain consistency. although long html would probably wrap within a cell and that does not apply to to_string. |
I agree- the original issue came from someone calling vanilla to_string and being surprised:
I think no truncation by default is the most intuitive approach, and matches to_html behavior. Someone who is surprised can find the parameter to truncate pretty easily in the docs, and it’s unlikely to be a surprise the same way that truncation might be. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let's call using the options.display
value a bug then.
This will need a release note in doc/source/whatsnew/v1.0.0.rst
under bug fixes.
…dn't realize that this also allowed None until checking the docs, but it does so it's the perfect validator for our new parameter.
Thanks for the feedback! I added the whatsnew notice, swapped to the correct validator, and did some more manual testing to ensure it worked as expected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good overall. A few small requests.
Ready for review, thanks! |
Co-Authored-By: Tom Augspurger <TomAugspurger@users.noreply.github.com>
@lshepard merge conflict in the release notes. Could you merge master & repush? |
Merge branch 'master' into issue9784-to-string-truncate-long-strings
Merge branch 'issue9784-to-string-truncate-long-strings' of github.com:lshepard/pandas into issue9784-to-string-truncate-long-strings
Huh, failed with “worker 'gw0' crashed while running 'pandas/tests/test_sorting.py::TestSafeSort::test_labels_out_of_bound[-1]'”. I don’t think that’s related but I’ll take a look. |
Could be a random failure. I'd start by repushing an empty commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Thanks for all the detailed comments on the review! |
@lshepard can you fix the merge conflict on the whatsnew? @TomAugspurger mind taking a look at this one? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too. Just need to fix the merge conflict.
Nice PR - thanks a lot @lshepard |
Excited to try this out and see it resolved. I hope proper left justification can be resolved soon too. Thanks. |
I modeled this off of #24841. Some alternatives I considered:
Here's an example on a real dataset showing long columns preserved in a text file produced by to_string():
Additional manual testing:
to_string
that doesn't:black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff