-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PEReader - Big Endian issues when getting a string #44805
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
I don't have the ability to add labels but would suggest: area-System.Reflection.Metadata |
EDIT: I do not see anything obviously wrong |
Tagging subscribers to this area: @tarekgh, @krwq Issue Details
|
I do not see how this can be problem. @nealef Would you be interested in proposing and testing a fix for this? |
There was a report a while back where somebody was encountering this issue, and the root cause was that the input buffer changed in the middle of the Is the machine experiencing this issue a big-endian machine? Do you have a sample payload we can investigate? |
@jkotas You are correct. The string is valid and doesn't require any endian specific fix. However, it does lead to the Exception shown above. What I have narrowed it down to are two problems:
A Couple of Test CasesConverting Bytes to UTF8I created a simple test case to see what the behaviour of converting bytes to UTF8 is for little endian and big endian:
For Little Endian I get:
For Big Endian:
Now this program didn't lead to an exception but the output is certainly incorrect. There seems to be an attempt at the BOM sequence in big endian output. I am not sure it is required and the bytes could be inserted as is. Using BlobReader ReadUTF8Next I tried creating a test case that uses BlobReader and using the same byte sequence as in the exception case to see if I could get it to generate the exception:
On Little Endian:
On Big Endian I get the exception.
Same Test Case Using a Shorter SequenceIf I reduce the byte sequence to just the first 4 bytes (+ null byte) I also get the exception. Same Byte Sequence with System.Text.UTF8Encoding.GetStringI also get the exception if I use the first test case with that sequence 4 byte sequence. Here the complete backtrace is:
BlobRead.ReadUTF8 with First Byte SequenceWith the byte sequence in the first test case I don't get the exception but I do get the same garbled output:
That output is identical to the first example. I can now compare the |
I pinged the build infra crew offline to see if we can get a big endian machine for testing. Ideally we'd have a CI leg for this. We nominally have very good unit test coverage over this code base. |
I can get you an account on one if you'd like. Ubuntu 20.04. I'm curious as to why what looks like a BOM is not correct but in the code it is. Using the mono 6.8 libraries the output of the 1st case is correct on big endian:
And with the byte sequence of the 2nd case:
No sign of BOM sequence. Mono doesn't have a BlobReader so there's no way to run the 2nd test case with either sequence of bytes. |
I haven't had time to comb through the corelib implementation yet, but here's my suspicion. I suspect what's happening is that there's a discrepancy in the behavior between the UTF-8 validator (which is responsible for counting how many UTF-16 chars would result from a Thank you for providing the hello world sample. I'll make time tomorrow to sit down and step through the code with pen + paper. I also have a request out to our build lab team to see if we can get our full unit test battery for this API running on a big endian machine. IMO we have comprehensive unit test coverage here, so getting the battery running on such a machine might flesh out any other latent bugs. |
I can send you the entire --trace output that will show all the methods called as well as their parameters and returned values.
The trace may be found at http://148.100.42.157/encoding.trc
|
Have you had a chance to look at this? If not, given the time of year and the recent release of .NET 5, I can appreciate this is a low priority. Using what you described above what methods should I look at to see if I can see where the discrepancy lies. |
This patch appears to fix the problem but I've not tested it with a little endian build yet.
|
Revised patch set that addresses a couple of other endian issues. The result of these is the following test status:
|
Fixed by #47981. |
When using
FileVersionInfo.GetVersionInfo
the PEReader uses the standard code path for creating a string from an array of bytes that takes it toSystem.Text.ASCIIUtility:GetIndexOfFirstNonAsciiByte_Default
which assumes the byte ordering is in the native order. However, PE objects such as, for example,Microsoft.Build.dll
the strings are in little endian order (for example, the © is 0xa9c2 in the PE object but is 0xc2a9 when used on Big Endian systems). This leads toSystem.Text.ASCIIUtility:WidenAsciiToUtf16
returning a non-zero result and then eventually, to:The PE decoding takes place in
BlobReader
and most of the decoding is endian aware. This does not seem to be the case forReadUTF8
. It would seem that ReadUTF8 may need to scan the UTF8 and reorder the multibyte characters before converting to a string.The text was updated successfully, but these errors were encountered: