Encoding problems when using JNA Windows terminal implementation #133

stephan-gh · 2017-06-13T11:24:07Z

I've a really simple command line app that reads from the console using a LineReader. (Basically just the example code from https://github.com/jline/jline3/wiki/Using-line-readers).

public class ReaderTest {

    public static void main(String[] args) throws Exception {
        Terminal terminal = TerminalBuilder.terminal();
        LineReader reader = LineReaderBuilder.builder()
                .terminal(terminal)
                .build();

        while (true) {
            String line;
            try {
                line = reader.readLine("> ");
                terminal.writer().println(line);
            } catch (UserInterruptException e) {
                // Ignore
            } catch (EndOfFileException e) {
                return;
            }
        }
    }

}

I've now installed a standard English Windows 10 system and run the JAR with the JNA implementation. As long as I use ASCII characters everything is fine. However, when I switch the keyboard layout to e.g. Russian, all the Cyrillic characters get displayed as ?.

As far as I can see, the JNA implementation reads characters from the console using Kernel32.ReadConsoleInput. Everything is still correct at this point, JLine gets the unicode char from the Windows API and when I debug the code, the string builder in readConsoleInput has the correct Cyrillic character.

However, readConsoleInput encodes all the characters again, with the standard system encoding. On my English system, this is windows-1252 which doesn't support the Cyrillic characters. Consequently, it already gets encoded incorrectly at this point (to character 63, ?).

It seems like the actual InputStreamReader will later read them using the current code page of the console. However, if all characters are actually read as unicode it might be better to keep them that way (or encode to UTF-8) and instruct the reader to decode using the same encoding. I can imagine that there might be still problems with the output of these characters then (since that depends on the code page), but currently even the input fails no matter which code page I select in the console.

The text was updated successfully, but these errors were encountered:

stephan-gh · 2017-06-13T11:51:29Z

It seems like changing the system encoding using -Dfile.encoding=UTF-8 allows the characters to be read correctly, but currently output fails - even if I set the code page to 65001 (UTF-8).

stephan-gh · 2017-06-13T12:15:02Z

The UTF-8 output seems to break due to the AnsiOutputStream. With the code page set to 65001 (UTF-8), the following code works correctly:

try (OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(FileDescriptor.out))) {
    out.write("ыыыыыы");
}

When I add the AnsiOutputStream it fails and prints ��:

try (OutputStreamWriter out = new OutputStreamWriter(new AnsiOutputStream(new FileOutputStream(FileDescriptor.out)))) {
    out.write("ыыыыыы");
}

These characters are written with 2 bytes each so I think this is because the FilterOutputStream separates them and FileOutputStream.write(int) is called for each of them instead of FileOutputStream.writeBytes(...).

It works again if I add a BufferedOutputStream after the AnsiOutputStream (probably because it buffers them again and writes them combined):

try (OutputStreamWriter out = new OutputStreamWriter(new AnsiOutputStream(new BufferedOutputStream(new FileOutputStream(FileDescriptor.out))))) {
    out.write("ыыыыыы");
}

stephan-gh · 2017-06-13T12:33:48Z

As a summary, my suggested fix for this issue would be to either make it the default, or add some kind of "UTF-8 option" that does the following:

Don't encode the input characters using the system charset but rather using UTF-8 (or UTF-16) and use the same for the input reader
Automatically set the output code page to 65001 (UTF-8) using SetConsoleOutputCP and configure the output stream to use UTF-8 (I'm not sure if this part could cause problems in some environments)
Add a BufferedOutputStream behind the WindowsAnsiOutputStream so the UTF-8 characters are displayed properly in the console (or find another way to write them together?)

With the output code page manually set to 65001, -Dfile.encoding=UTF-8 and the BufferedOutputStream behind the WindowsAnsiOutputStream, input of the Cyrillic characters worked fine with the current version of JLine.

I could be entirely wrong about this issue, so let me know what you think.

gnodet · 2017-06-15T09:49:38Z

Could you check the above patch ? If it works for you, I'll merge it to master.

stephan-gh · 2017-06-15T10:30:44Z

@gnodet Thanks! Your fix solves the problem with the input encoding. The characters are now read correctly, but are not displayed correctly in the console due to console output encoding problems:

JLine's check to get the charset from the current code page doesn't work for code page 65001 (UTF-8) because it's not registered as cp65001 or ms65001 in Java as far as I can see (it's just UTF-8). JLine falls back to the system's standard encoding, which doesn't work for the Cyrillic characters.
The other two points I mentioned in my summary above still apply, it would be nice if JLine could set the console output code page automatically, otherwise users will have to type chcp 65001 before starting the application.

stephan-gh · 2017-06-16T10:21:26Z

@gnodet Thanks for the additional changes. I managed to get it working and have added two small comments to your commit.

stephan-gh · 2017-06-16T11:54:26Z

@gnodet Latest commit is working correctly now, thanks!

stephan-gh mentioned this issue Jun 13, 2017

Console cyrillic input support? PaperMC/Paper#736

Closed

gnodet added a commit that referenced this issue Jun 15, 2017

Encoding problems when using JNA Windows terminal implementation #133

8b2de8c

gnodet added a commit that referenced this issue Jun 16, 2017

Fix console output code page and add a BufferedWriter, #133

cc688bd

gnodet added a commit to gnodet/jline3 that referenced this issue Jun 16, 2017

Fix things for jline#133

ae265f7

gnodet added a commit to gnodet/jline3 that referenced this issue Jun 16, 2017

Use the buffered output stream at the correct location, jline#133

de2f031

gnodet closed this as completed in b120987 Jun 16, 2017

gnodet self-assigned this Jun 16, 2017

gnodet added the bug label Jun 16, 2017

gnodet added this to the 3.4.0 milestone Jun 16, 2017

gnodet mentioned this issue Sep 6, 2017

shouldn't hardcode the codepage as 65001 for windows terminal #164

Closed

gnodet added a commit that referenced this issue Sep 7, 2017

Attempt to fix both #133 and #164 ...

a947144

gnodet added a commit that referenced this issue Sep 7, 2017

Attempt to fix both #133 and #164 ...

7d33254

stephan-gh mentioned this issue Sep 13, 2017

Use WriteConsoleW to write to Windows console #168

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding problems when using JNA Windows terminal implementation #133

Encoding problems when using JNA Windows terminal implementation #133

stephan-gh commented Jun 13, 2017 •

edited

Loading

stephan-gh commented Jun 13, 2017

stephan-gh commented Jun 13, 2017 •

edited

Loading

stephan-gh commented Jun 13, 2017 •

edited

Loading

gnodet commented Jun 15, 2017

stephan-gh commented Jun 15, 2017 •

edited

Loading

stephan-gh commented Jun 16, 2017

stephan-gh commented Jun 16, 2017

Encoding problems when using JNA Windows terminal implementation #133

Encoding problems when using JNA Windows terminal implementation #133

Comments

stephan-gh commented Jun 13, 2017 • edited Loading

stephan-gh commented Jun 13, 2017

stephan-gh commented Jun 13, 2017 • edited Loading

stephan-gh commented Jun 13, 2017 • edited Loading

gnodet commented Jun 15, 2017

stephan-gh commented Jun 15, 2017 • edited Loading

stephan-gh commented Jun 16, 2017

stephan-gh commented Jun 16, 2017

stephan-gh commented Jun 13, 2017 •

edited

Loading

stephan-gh commented Jun 13, 2017 •

edited

Loading

stephan-gh commented Jun 13, 2017 •

edited

Loading

stephan-gh commented Jun 15, 2017 •

edited

Loading