Retext on Windows can not save with error "'Charmap' codec can't encode character ..." #599

szpeter80 · 2022-11-01T17:52:19Z

The exact error message is: "'Charmap' codec can't encode character xxxx in position yyy: character maps to " .

How to reproduce ?

install retext on windows via 'pip install retext'
open a new document, enter 7bit clean text, save as 'test.md'
enter a national character in the markdown source then try to save
the error message pops up, and the output file is truncated to 0 byte (previous work is lost)

As far as i was to able to track down this, the issue is specific to Windows' codepage handling. The copy of Retext installed to the venv exhibits this behaviour either when launched from a cmd shell or with double click on retext.exe. The trigger was a single character (for example \u0151 (ő) which is not representable with the character page used by the cmd (chcp reports 437, windows installed as English)

The issue (4) is not observed, if the windows single-byte codepage can handle the accents wich is entered (in this case on a different machine, the install locale was "Hungarian", the resulting codepage is CP852 (weird, this is old ms-dos codepage, windows used to use cp12xx back then), and that codepage has a code reserved for "ő" and Retext saves the document successfully (altough in CP852 encoding).

Workaround 1: install Retext globally (no venv). For any reason, this behaviour is not observed if installed globally. The issue i see with this is that the install directory of the global packages is an app (=python, it installs as an app from the Microsoft Store) specific directory containing the version of Python installed, which might get deleted when the package updates or might left there as leftover (and possibly broken) junk.

Workaround 2: Start Retext from a batch file, and issue the chcp 65001 command before invoking retext.exe. 65001 is the code for unicode code page and this seems to solve the unrepresentable character issue. Beware, if the markdown source was created before, it might be in ansi (1-byte) encoding, and needs to be checked and converted to unicode / utf-8 (eg via Notepad++).

The text was updated successfully, but these errors were encountered:

mitya57 · 2022-12-18T19:21:05Z

I think we should always use UTF-8 by default. 1-byte regional encodings are so outdated in 2022, and UTF-8 is the default on Linux anyway.

It shouldn't be a problem for existing documents. ReText uses chardet, so existing documents will be opened/saved with whatever encoding they have, provided that it's detected correctly.

What do you think?

Also, I will fix truncating the file to 0 bytes when the current encoding does not support some characters.

szpeter80 · 2022-12-23T18:19:51Z

As far as i can tell, the Windows installer of Python tries to take care the "chcp 65001" by including it in its wrapper script, just for some reason it is not effective all the times. It's not Retext's job to fix a win-py compatibility problem.

If you can fix the file truncating problem, that would prevent the user to shoot itself in the foot unknowingly. Thanks !

Otherwise, if encoding the text fails, the file becomes truncated. Fixes #599. (cherry picked from commit e30c785)

mitya57 closed this as completed in e30c785 May 21, 2023

mitya57 added a commit that referenced this issue May 28, 2023

tab: Detect encoding errors before opening file for writing

0c89025

Otherwise, if encoding the text fails, the file becomes truncated. Fixes #599. (cherry picked from commit e30c785)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retext on Windows can not save with error "'Charmap' codec can't encode character ..." #599

Retext on Windows can not save with error "'Charmap' codec can't encode character ..." #599

szpeter80 commented Nov 1, 2022 •

edited

Loading

mitya57 commented Dec 18, 2022

szpeter80 commented Dec 23, 2022

Retext on Windows can not save with error "'Charmap' codec can't encode character ..." #599

Retext on Windows can not save with error "'Charmap' codec can't encode character ..." #599

Comments

szpeter80 commented Nov 1, 2022 • edited Loading

mitya57 commented Dec 18, 2022

szpeter80 commented Dec 23, 2022

szpeter80 commented Nov 1, 2022 •

edited

Loading