Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Umlauts in ISO-8859-1 not properly displayed #568

Closed
cloudyster opened this issue May 24, 2019 · 9 comments
Closed

Umlauts in ISO-8859-1 not properly displayed #568

cloudyster opened this issue May 24, 2019 · 9 comments
Labels
documentation help wanted Extra attention is needed question Further information is requested

Comments

@cloudyster
Copy link

when working with files in ISO-8859-1 encoding umlauts are not properly displayed

> file i.xml 
i.xml: ISO-8859 text

> unalias cat
> cat i.xml
Hüsker Dü

> bat --plain --color never  i.xml
H�sker D�

Cheers,
Marc

@eth-p
Copy link
Collaborator

eth-p commented May 24, 2019

I'm not able to reproduce the issue, unfortunately. I think GitHub or my browser is normalizing the encoding from your snippet above. Would it be possible to provide a small test file?

@cloudyster
Copy link
Author

Shure, here you are. I've had to change file suffix to .txt, github did not allow .xml.
i.xml.txt

@sharkdp
Copy link
Owner

sharkdp commented May 28, 2019

Thank you for reporting this. bat only supports UTF-8 and UTF-16. I am currently not planning to add detection ("guessing") for other encodings. You probably know that you can use external tools like iconv to change the encoding:

iconv -f ISO-8859-1 i.xml | bat -lxml

I'm a bit surprised that cat seems to work for you. It doesn't work for me and shows the same "invalid UTF-8" character as bat:

▶ bat --plain --color never i.xml 
H�sker D�

▶ \cat i.xml 
H�sker D�

Is your terminal emulator configured to read ISO-8859? If I pipe the output of bat and cat to hexyl, we can see that the output is identical:

▶ bat --plain --color never i.xml | hexyl
┌────────┬─────────────────────────┬─────────────────────────┬────────┬────────┐
│00000000│ 48 fc 73 6b 65 72 20 44 ┊ fc 0a                   │H×sker D┊×_      │
└────────┴─────────────────────────┴─────────────────────────┴────────┴────────┘

▶ \cat i.xml | hexyl                     
┌────────┬─────────────────────────┬─────────────────────────┬────────┬────────┐
│00000000│ 48 fc 73 6b 65 72 20 44 ┊ fc 0a                   │H×sker D┊×_      │
└────────┴─────────────────────────┴─────────────────────────┴────────┴────────┘

@sharkdp sharkdp added help wanted Extra attention is needed question Further information is requested labels May 28, 2019
@cloudyster
Copy link
Author

Hello David,
thank's for investigating. This seems to be a terminal/environment/encoding issue where bat behaves fine.
Just for the record: cat works as described above with gnome-terminal with UTF-8 encoding, when switching to ISO-8859-1 umlauts are displayed.
I'm running urxvt (rxvt-unicode-256color) where bat/cat behave as in the initial report.
Cheers,
Marc

@sharkdp
Copy link
Owner

sharkdp commented May 29, 2019

I'm running urxvt (rxvt-unicode-256color) where bat/cat behave as in the initial report.

Right, I can confirm that. I think the issue is the following: in interactive mode (when the output is not piped somewhere else, like in my example above) bat will actually output instead of the actual bytes (like cat does) because it assumes UTF-8 input (or UTF-16).

Arguably, printing is a feature and not a bug 😄. Printing arbitrary bytes to the console could potentially mess up your terminal emulator. For a similar reason, bat does not print out binary files (in interactive mode).

In this particular example (where the urxvt terminal emulator somehow manages to interpret this byte sequence correctly) this is obviously not optimal. However, in the general case, I think the current behavior of bat is okay.

What do you think?

@sharkdp sharkdp reopened this May 29, 2019
@cloudyster
Copy link
Author

Hello David,

I agree that the current implementation is ok as far as I can judge the problem. Having a section in Troubleshooting might be a good idea.

I've just hat a look at \cat --help - there is this -v/--show-nonprinting option which
somehow corresponds to bat's -A, --show-all, but I like the idea/implementation in cat better.
By the way: -v seems to be missing for the goal

Be a drop-in replacement for (POSIX) cat

Cheers,
Marc

@sharkdp
Copy link
Owner

sharkdp commented May 31, 2019

I've just hat a look at \cat --help - there is this -v/--show-nonprinting option which
somehow corresponds to bat's -A, --show-all

Well, bats -A/--show-all option corresponds to cats -A/--show-all option 😄. The -v/--show-nonprinting option of cat does just a part of -A/--show-all.

I like the idea/implementation in cat better.

Could you go into more detail? I spent quite some time designing bats --show-all feature and would be grateful for feedback.

image
image

By the way: -v seems to be missing for the goal

Be a drop-in replacement for (POSIX) cat

Not really (see #134). -v/--show-nonprinting is not part of the POSIX standard.

@sharkdp
Copy link
Owner

sharkdp commented Aug 15, 2019

@cloudyster Any further feedback on this?

I see the following action points so far:

  • Add a new option to bat to show the file as is.
  • Add a section about encodings to the Troubleshooting section. This should mention iconv and the newly created option for bat from the previous point
  • Handle non UTF-8 input in bat -A-mode. Instead of printing , we could print \xFC. This is also tracked in Add option to display file even if bat thinks it is binary #623

@sharkdp
Copy link
Owner

sharkdp commented Oct 11, 2019

Closing this without resolving the first point. I'm not convinced we need this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants