Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Platform win32] Fix crash when pipe encoding is set to None #4584

Merged
merged 6 commits into from
Sep 3, 2024

Conversation

siegria
Copy link
Contributor

@siegria siegria commented Aug 8, 2024

If stdout or stderr argument is of type io.stringIO, the function crash because stringIO has it's encoding property set to None.
This issue has been introduced by this commit in version 4.7.0

Contributor Checklist:

  • I have created a new test or updated the unit tests to cover the new/changed functionality.
  • I have updated CHANGES.txt (and read the README.rst)
  • [x ] I have updated the appropriate documentation

As None is not a valid encoding value, fallback to 'utf-8'. This case happen if stdout or stderr is of type io.stringIO.
@bdbaddog
Copy link
Contributor

bdbaddog commented Aug 8, 2024

Can you add a test? And a blurb to RELEASE.txt and CHANGES.txt ?

@bdbaddog
Copy link
Contributor

bdbaddog commented Aug 8, 2024

What's causing stdout.encoding to be None?

@mwichmann
Copy link
Collaborator

What's causing stdout.encoding to be None?

That's just the way it is (default) - from Python, not from us.

@siegria
Copy link
Contributor Author

siegria commented Aug 9, 2024

Can you add a test? And a blurb to RELEASE.txt and CHANGES.txt ?

I updated RELEASE.txt and CHANGES.txt.
For the test is it not working yet and I am going in holiday for 3 weeks .

@jcbrill
Copy link
Contributor

jcbrill commented Aug 12, 2024

The proposed changes in this PR are similar to an issue that arose in the Godot project. The Godot issue was based on an implementation proposed in a SCons PR that was closed and subsequently implemented in Godot.

For reference:

  1. SCons: silence_msvc implementation issues godotengine/godot#91883 (comment)
  2. SCons: Fix silence_msvc implementation errors godotengine/godot#91890 (comment)

Excerpt from the Godot issue [1] proposed implementation which may apply here as well:

  1. The binary read of the temporary stdout file followed by a decoded write to the stderr stream will likely contain extra newline characters for each line.

    The binary read of the temporary stdout file contains CRLF (i.e., \r\n) line endings. The CRLF sequence remains after decoding. This effectively causes an extra new line for each line of content written.

  2. There may be unnecessary unicode replacement characters introduced in the written stderr content due to the implemented encoding.

    The encoding selected is that of python's stdout stream. However, the temporary stdout file is populated by the output of the msvc compiler and linker commands. SCons uses encoding "oem" in some cases when processing msvc command output.

While the above applied specifically to msvc compiler and linker invocations, the general win32 redirected output is from the invoked program's stdout/stderr redirected to temporary files. The console encoding likely should be used for decoding the temporary output file reads.

It might be worth testing the following:

    if stdout is not None and not stdoutRedirected:
        try:
            with open(tmpFileStdoutName, "rb") as tmpFileStdout:
                output = tmpFileStdout.read()
-               stdout.write(output.decode(stdout.encoding, "replace"))
+               stdout.write(output.decode("oem", "replace").replace("\r\n", "\n"))
            os.remove(tmpFileStdoutName)
        except OSError:
            pass
    if stderr is not None and not stderrRedirected:
        try:
            with open(tmpFileStderrName, "rb") as tmpFileStderr:
                errors = tmpFileStderr.read()
-               stderr.write(errors.decode(stderr.encoding, "replace"))
+               stderr.write(errors.decode("oem", "replace").replace("\r\n", "\n"))
            os.remove(tmpFileStderrName)
        except OSError:
            pass

It also might be worth checking if multi-line writes to the redirected temporary files actually contain extra newline characters following a decode without the explicit replacement (e.g. .replace("\r\n", "\n")).

I know in some of the tests done when writing the Godot isssue, there were more unicode replacements when using a native Windows language other than English (e.g., German) prior to changing to "oem".

As always, I could be really wrong.

@jcbrill
Copy link
Contributor

jcbrill commented Aug 12, 2024

The following example illustrates the CRLF issue when decoding binary reads of the console output.

Note: any differences in the actual decoding method ("utf-8", "oem", etc., ...), if any, will be addressed in a future example.

Source file:

#include <stdio.h>
int main(int argc, char* argv) {
    printf("hello");
    return 0;
}

SConstruct:

import SCons
import sys
import io
import atexit

mystdout = StringIO()
mystderr = StringIO()

def dump_stringio():
    mystderr.seek(0)
    mystdout.seek(0)
    print()
    print("---------STDERR----------")
    print(mystderr.read())
    print("---------STDOUT----------")
    print(mystdout.read())
    print("-------------------------")

atexit.register(dump_stringio)

def piped_spawn(sh, escape, cmd, args, env):
    from SCons.Platform.win32 import piped_spawn as scons_piped_spawn
    rval = scons_piped_spawn(sh, escape, cmd, args, env, mystdout, mystderr)
    return rval

class EnvironmentFactory:

    program = 'hello'
    program_dir = './src'
    program_files = ['hello.c']

    env_list = []
 
    @classmethod
    def make_program(cls, **kwargs):
        build_n = len(cls.env_list) + 1
        build = '_build{:03d}'.format(build_n)
        print('Build:', build, kwargs, file=sys.stdout)
        VariantDir(build, cls.program_dir, duplicate=0)
        env=Environment(tools=['msvc', 'mslink'], **kwargs)
        env["SPAWN"] = piped_spawn
        build += '/'
        env.Program(build + cls.program, [build + filename for filename in cls.program_files])
        cls.env_list.append(env)
        return env

for kwargs in [
    {'MSVC_VERSION': '14.3', 'CCFLAGS': '/nologo /showIncludes /fakecl', 'LINKFLAGS': '/nologo /fakelink'},
]:
    EnvironmentFactory.make_program(**kwargs)

PR decoding:

  • stdout.write(output.decode(stdout.encoding if stdout.encoding is not None else 'utf-8', "replace"))
  • stderr.write(errors.decode(stderr.encoding if stderr.encoding is not None else 'utf-8', "replace"))
scons: Reading SConscript files ...
Build: _build001 {'MSVC_VERSION': '14.3', 'CCFLAGS': '/nologo /showIncludes /fakecl', 'LINKFLAGS': '/nologo /fakelink'}
scons: done reading SConscript files.
scons: Building targets ...
cl /Fo_build001\hello.obj /c src\hello.c /nologo /showIncludes /fakecl
link /nologo /fakelink /OUT:_build001\hello.exe _build001\hello.obj
scons: done building targets.

---------STDERR----------
cl : Command line warning D9002 : ignoring unknown option '/fakecl'


---------STDOUT----------
hello.c

Note: including file: C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt\stdio.h

Note: including file:  C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt\corecrt.h

Note: including file:   C:\Software\MSVS-2022-143-Com\VC\Tools\MSVC\14.40.33807\include\vcruntime.h

Note: including file:    C:\Software\MSVS-2022-143-Com\VC\Tools\MSVC\14.40.33807\include\sal.h

Note: including file:     C:\Software\MSVS-2022-143-Com\VC\Tools\MSVC\14.40.33807\include\concurrencysal.h

Note: including file:    C:\Software\MSVS-2022-143-Com\VC\Tools\MSVC\14.40.33807\include\vadefs.h

Note: including file:  C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt\corecrt_wstdio.h

Note: including file:   C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt\corecrt_stdio_config.h

LINK : warning LNK4044: unrecognized option '/fakelink'; ignored


-------------------------

Suggested decoding:

  • stdout.write(output.decode("oem", "replace").replace("\r\n", "\n"))
  • stderr.write(errors.decode("oem", "replace").replace("\r\n", "\n"))
scons: Reading SConscript files ...
Build: _build001 {'MSVC_VERSION': '14.3', 'CCFLAGS': '/nologo /showIncludes /fakecl', 'LINKFLAGS': '/nologo /fakelink'}
scons: done reading SConscript files.
scons: Building targets ...
cl /Fo_build001\hello.obj /c src\hello.c /nologo /showIncludes /fakecl
link /nologo /fakelink /OUT:_build001\hello.exe _build001\hello.obj
scons: done building targets.

---------STDERR----------
cl : Command line warning D9002 : ignoring unknown option '/fakecl'

---------STDOUT----------
hello.c
Note: including file: C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt\stdio.h
Note: including file:  C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt\corecrt.h
Note: including file:   C:\Software\MSVS-2022-143-Com\VC\Tools\MSVC\14.40.33807\include\vcruntime.h
Note: including file:    C:\Software\MSVS-2022-143-Com\VC\Tools\MSVC\14.40.33807\include\sal.h
Note: including file:     C:\Software\MSVS-2022-143-Com\VC\Tools\MSVC\14.40.33807\include\concurrencysal.h
Note: including file:    C:\Software\MSVS-2022-143-Com\VC\Tools\MSVC\14.40.33807\include\vadefs.h
Note: including file:  C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt\corecrt_wstdio.h
Note: including file:   C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt\corecrt_stdio_config.h
LINK : warning LNK4044: unrecognized option '/fakelink'; ignored

-------------------------

@jcbrill
Copy link
Contributor

jcbrill commented Aug 12, 2024

The following example illustrates a decoding issue when using "utf-8" compared to "oem".

When the language is English there are no differences. When the language is German there are differences.

SConstruct changes:

force_encode_decode = False

def dump_stringio():
    mystderr.seek(0)
    mystdout.seek(0)
    
    errors = mystderr.read()
    output = mystdout.read()
    if force_encode_decode:
        errors = errors.encode(sys.stdout.encoding, errors="replace").decode(sys.stdout.encoding)
        output = output.encode(sys.stdout.encoding, errors="replace").decode(sys.stdout.encoding)
        
    print()
    print("---------STDERR----------")
    print(errors)
    print("---------STDOUT----------")
    print(output)
    print("-------------------------")

atexit.register(dump_stringio)

...

for kwargs in [
    {'MSVC_VERSION': '14.3', 'CCFLAGS': '/nologo /showIncludes /fakecl', 'LINKFLAGS': '/nologo /SUBSYSTEM:CONSOLE,4.0 /fakelink'},
]:
    EnvironmentFactory.make_program(**kwargs)

PR decoding:

  • English

    1. decode="utf-8", force_encode_decode=False
      LINK : warning LNK4010: invalid subsystem version number 4.0; default subsystem version assumed
      
  • German:

    1. decode="utf-8", force_encode_decode=False (unicode error printing)

      UnicodeEncodeError: 'charmap' codec can't encode character '\ufffd' in position 1047: character maps to <undefined>
      
    2. decode="utf-8", force_encode_decode=True (replacement character encode/decode)

      LINK : warning LNK4010: Ung?ltige Versionsnummer 4.0; Standardversion des Subsystems wird angenommen.
      
    3. decode="oem", force_encode_decode=False

      LINK : warning LNK4010: Ungültige Versionsnummer 4.0; Standardversion des Subsystems wird angenommen.
      

Suggested decoding:

  • English:

    • decode="oem", force_encode_decode=False
      LINK : warning LNK4010: invalid subsystem version number 4.0; default subsystem version assumed
      
  • German:

    • decode="oem", force_encode_decode=False`
      LINK : warning LNK4010: Ungültige Versionsnummer 4.0; Standardversion des Subsystems wird angenommen.
      

@mwichmann
Copy link
Collaborator

Right... and MSCommon uses oem, as we were advised by someone a while back, once we got to Python versions where that was consistently supported (3.6+). So is there a problem with this?

@jcbrill
Copy link
Contributor

jcbrill commented Aug 12, 2024

Right... and MSCommon uses oem, as we were advised by someone a while back, once we got to Python versions where that was consistently supported (3.6+). So is there a problem with this?

The first issue is that when passing a StringIO object (e.g., obj = StringIO()) for stdout/stderr in the piped spawn is that the obj.encoding is None.

I believe that decode should just use "oem". The temporary file contents are populated by redirecting the output from the invoked process. I believe that the console encoding should be used rather than the python stream encoding.

Instead of using the stream encoding as in this PR:
stdout.write(output.decode(stdout.encoding if stdout.encoding is not None else 'utf-8', "replace")) and
stderr.write(errors.decode(stderr.encoding if stderr.encoding is not None else 'utf-8', "replace"))

Use the "oem" encoding (like MSCommon):
stdout.write(output.decode("oem", "replace")) and
stderr.write(errors.decode("oem", "replace"))

If the temporary file reads remain in binary, then the explicit CR LF sequence needs to be replaced with a NL to avoid the "extra" blank lines:
stdout.write(output.decode("oem", "replace").replace("\r\n", "\n")) and
stderr.write(errors.decode("oem", "replace").replace("\r\n", "\n"))

The changes for this PR are:

    if stdout is not None and not stdoutRedirected:
        try:
            with open(tmpFileStdoutName, "rb") as tmpFileStdout:
                output = tmpFileStdout.read()
-               stdout.write(output.decode(stdout.encoding, "replace"))
+               stdout.write(output.decode(stdout.encoding if stdout.encoding is not None else 'utf-8', "replace"))
            os.remove(tmpFileStdoutName)
        except OSError:
            pass

    if stderr is not None and not stderrRedirected:
        try:
            with open(tmpFileStderrName, "rb") as tmpFileStderr:
                errors = tmpFileStderr.read()
-               stderr.write(errors.decode(stderr.encoding, "replace"))
+               stderr.write(errors.decode(stderr.encoding if stderr.encoding is not None else 'utf-8', "replace"))
            os.remove(tmpFileStderrName)
        except OSError:
            pass

I believe that this may be more robust:

    if stdout is not None and not stdoutRedirected:
        try:
            with open(tmpFileStdoutName, "rb") as tmpFileStdout:
                output = tmpFileStdout.read()
-               stdout.write(output.decode(stdout.encoding, "replace"))
+               stdout.write(output.decode("oem", "replace").replace("\r\n", "\n"))
            os.remove(tmpFileStdoutName)
        except OSError:
            pass

    if stderr is not None and not stderrRedirected:
        try:
            with open(tmpFileStderrName, "rb") as tmpFileStderr:
                errors = tmpFileStderr.read()
-               stderr.write(errors.decode(stderr.encoding, "replace"))
+               stderr.write(errors.decode("oem", "replace").replace("\r\n", "\n"))
            os.remove(tmpFileStderrName)
        except OSError:
            pass

CHANGES.txt Outdated Show resolved Hide resolved
@bdbaddog
Copy link
Contributor

I refactored your fix, I'm not a big fan of x if xyz else abc form.
This way we're clearly calling out the invalid None value.

@jcbrill
Copy link
Contributor

jcbrill commented Aug 16, 2024

The refactored code introduced bugs.

The stream encoding attribute may not be writable:

    # Sanitize encoding. None is not a valid encoding.
    # Since we're handling a redirected shell command use
    # the shells default encoding.
    if stdout.encoding is None:
        stdout.encoding = 'oem'  <==== StringIO encoding not writable
    if stderr.encoding is None:
        stderr.encoding = 'oem'  <==== StringIO encoding not writable

Writing to the encoding attribute of a StringIO object is not allowed:

    stdout.encoding = 'oem'
AttributeError: attribute 'encoding' of '_io._TextIOBase' objects is not writable

The original implementation didn't attempt to assign the encoding but rather picked the encoding to pass to the decode call. If one really wants to use the stdout.encoding and stderr.encoding values if they are not None, then temporaries values should be used.

For example:

...
    # Since we're handling a redirected shell command use
    # the shells default encoding.
    if stdout.encoding is None:
        stdout_encoding = 'oem'
    else:
        stdout_encoding = stdout.encoding
    if stderr.encoding is None:
        stderr_encoding = 'oem'
    else:
        stderr_encoding = stderr.encoding
...
                stdout.write(output.decode(stdout_encoding, "replace").replace("\r\n", "\n"))
...
                stderr.write(errors.decode(stderr_encoding, "replace").replace("\r\n", "\n"))

But...

I'm not sure we want to use stdout.encoding and stderr.encoding even if they are not None to decode the temporary files which are the result of redirected output from a (likely) external command.

It may be a good idea to always use "oem" rather than what may be set for the python stream encoding. The temporary file is likely being written by an external process and not necessarily the current python.

Using the python stream encoding versus an external process console encoding seems like an apples versus oranges type of issue.

SCons\Tool\MSCommon\common simply uses "oem" to decode subprocess (not temporary file) output:

    # Ongoing problems getting non-corrupted text led to this
    # changing to "oem" from "mbcs" - the scripts run presumably
    # attached to a console, so some particular rules apply.
    OEM = "oem"
    if cp.stderr:
        # TODO: find something better to do with stderr;
        # this at least prevents errors from getting swallowed.
        sys.stderr.write(cp.stderr.decode(OEM))
    if cp.returncode != 0:
        raise OSError(cp.stderr.decode(OEM))

    return cp.stdout.decode(OEM)

I suggest removing stdout.encoding and stderr.encoding entirely and always using "oem" for decoding the temporary file content:

- # Sanitize encoding. None is not a valid encoding.
-     # Since we're handling a redirected shell command use
-     # the shells default encoding.
-     if stdout.encoding is None:
-         stdout.encoding = 'oem'
-     if stderr.encoding is None:
-         stderr.encoding = 'oem'
...
-               stdout.write(output.decode(stdout.encoding, "replace").replace("\r\n", "\n"))
+               stdout.write(output.decode("oem", "replace").replace("\r\n", "\n"))
...
-               stderr.write(errors.decode(stderr.encoding, "replace"))
+               stderr.write(errors.decode("oem", "replace").replace("\r\n", "\n"))

@mwichmann Any thoughts?

@bdbaddog
Copy link
Contributor

bdbaddog commented Sep 2, 2024

@jcbrill as usual your suggested fix and analysis are well thought out!
If I'm understanding correctly though, there's no good way to test this unless we fired up a VM where the system default language was german?

@jcbrill
Copy link
Contributor

jcbrill commented Sep 2, 2024

If I'm understanding correctly though, there's no good way to test this unless we fired up a VM where the system default language was german?

Exactly.

That is how the output fragments above were produced. A windows 11 VMWare virtual machine with german as the default system language and necessary language packs installed.

German was the first non-English system language that I tried when trying to find any encoding issues with the proposed spawn changes for the godot project and the SCons PR that was eventually rejected due to being implemented in/around the spawn code for win32.

Producing compiler warnings and error messages illustrated the extra newline issue.

The details are a little fuzzy now, but there may have also been reliance on either an english word or a phrase order that changed with the language for the original godot-like proposed changes.

Shown above, one of the german warning messages contains the ü character which was replaced with ? when decoding.

@mwichmann
Copy link
Collaborator

Amusingly (maybe), the last time we had a go-round with Windows and non-ascii characters, I set up the Swedish language support, as it's a native language for me, where German is only a learned language (and largely unused for many many years). But it turned out Swedish is not a language the compiler suite produces i18n messages for, so I had to switch that test setup to German (I no longer have that VM)

@bdbaddog bdbaddog merged commit b6c11d1 into SCons:master Sep 3, 2024
7 of 8 checks passed
@mwichmann mwichmann added this to the NextRelease milestone Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants