Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retext on Windows can not save with error "'Charmap' codec can't encode character ..." #599

Closed
szpeter80 opened this issue Nov 1, 2022 · 2 comments

Comments

@szpeter80
Copy link

szpeter80 commented Nov 1, 2022

The exact error message is: "'Charmap' codec can't encode character xxxx in position yyy: character maps to " .

How to reproduce ?

  1. install retext on windows via 'pip install retext'
  2. open a new document, enter 7bit clean text, save as 'test.md'
  3. enter a national character in the markdown source then try to save
  4. the error message pops up, and the output file is truncated to 0 byte (previous work is lost)

As far as i was to able to track down this, the issue is specific to Windows' codepage handling. The copy of Retext installed to the venv exhibits this behaviour either when launched from a cmd shell or with double click on retext.exe. The trigger was a single character (for example \u0151 (ő) which is not representable with the character page used by the cmd (chcp reports 437, windows installed as English)

The issue (4) is not observed, if the windows single-byte codepage can handle the accents wich is entered (in this case on a different machine, the install locale was "Hungarian", the resulting codepage is CP852 (weird, this is old ms-dos codepage, windows used to use cp12xx back then), and that codepage has a code reserved for "ő" and Retext saves the document successfully (altough in CP852 encoding).

Workaround 1: install Retext globally (no venv). For any reason, this behaviour is not observed if installed globally. The issue i see with this is that the install directory of the global packages is an app (=python, it installs as an app from the Microsoft Store) specific directory containing the version of Python installed, which might get deleted when the package updates or might left there as leftover (and possibly broken) junk.

Workaround 2: Start Retext from a batch file, and issue the chcp 65001 command before invoking retext.exe. 65001 is the code for unicode code page and this seems to solve the unrepresentable character issue. Beware, if the markdown source was created before, it might be in ansi (1-byte) encoding, and needs to be checked and converted to unicode / utf-8 (eg via Notepad++).

@mitya57
Copy link
Member

mitya57 commented Dec 18, 2022

I think we should always use UTF-8 by default. 1-byte regional encodings are so outdated in 2022, and UTF-8 is the default on Linux anyway.

It shouldn't be a problem for existing documents. ReText uses chardet, so existing documents will be opened/saved with whatever encoding they have, provided that it's detected correctly.

What do you think?

Also, I will fix truncating the file to 0 bytes when the current encoding does not support some characters.

@szpeter80
Copy link
Author

As far as i can tell, the Windows installer of Python tries to take care the "chcp 65001" by including it in its wrapper script, just for some reason it is not effective all the times. It's not Retext's job to fix a win-py compatibility problem.

If you can fix the file truncating problem, that would prevent the user to shoot itself in the foot unknowingly. Thanks !

mitya57 added a commit that referenced this issue May 28, 2023
Otherwise, if encoding the text fails, the file becomes truncated.

Fixes #599.

(cherry picked from commit e30c785)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants