Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove use of codecvt in PAL #44098

Closed
ThadHouse opened this issue Oct 31, 2020 · 8 comments · Fixed by #44466
Closed

Remove use of codecvt in PAL #44098

ThadHouse opened this issue Oct 31, 2020 · 8 comments · Fixed by #44466
Labels
area-Host untriaged New issue has not been triaged by the area owner

Comments

@ThadHouse
Copy link
Contributor

codecvt was deprecated as of C++ 17, and likely should be removed for future compatibility.

In Windows, its included only in pal.windows.cpp, however it doesn't actually seem to be used anywhere.

In Unix, its included only in pal.linux.cpp, and is used to convert a utf8 string to utf16 string, which there for sure is another way to do this.

I noticed this, because this was added in April, and now causes the armel non tizen build to fail, as it does not have access to codecvt.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Oct 31, 2020
@Dotnet-GitSync-Bot
Copy link
Collaborator

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost
Copy link

ghost commented Oct 31, 2020

Tagging subscribers to this area: @vitek-karas, @agocke
See info in area-owners.md if you want to be subscribed.

@am11
Copy link
Member

am11 commented Nov 2, 2020

There is only one usage of codecvt in host for bundle_probe. On Unix, we convert utf8 to utf16 in runtime before invoking the bundle probe; then convert it back to utf8 in host. On Windows, we use utf16 end-to-end for this scenario.

I think there are two approaches to solve this:

  1. flip bundle_probe contract from utf16 to utf8 parameter, and use a well-known MultiByteToWideChar win32 conversion function in pal.windows.cpp. The runtime will invoke bundle_probe with utf8 argument char*, instead of char16_t*.
  2. add a build-time fallback using c16rtomb in pal.unix.cpp based on cmake introspection results: HAVE_CODECVT, HAVE_CUCHAR.

I tried #2, but it does not convert all characters on Linux; I think it relies on the locales installed on the build machine, unlike codecvt which is self-sufficient, when available. The reason for cmake introspection was that the header <cuchar> (which provides c16rtomb) is not available on macOS.

Therefore, while #1 might be riskier; as host might no longer be backwards compatible for some custom scenario which is using bundle_probe, but more reliable solution.. not sure what other risks it might entail.

@janvorli
Copy link
Member

janvorli commented Nov 3, 2020

Since it is used just to convert between utf-8 and utf-16, we could possibly use the code we have in the coreclr PAL - the UnicodeToUTF8 and UTF8ToUnicode methods.

@janvorli janvorli closed this as completed Nov 3, 2020
@janvorli janvorli reopened this Nov 3, 2020
@janvorli
Copy link
Member

janvorli commented Nov 3, 2020

However, it would not be nice to share as the code lives in different locations and it is heavily used by the coreclr PAL (via the MultiByteToWideChar and WideCharToMultiByte).
The option #1 sounds reasonable to me.

@vitek-karas
Copy link
Member

I'm hesitant about #1 - it's changing the interface between runtime and hostpolicy. While we can make the change (the two components are shipped and versioned together), we typically try to avoid making breaking changes to the interface - just to be safe.

I'm no expert on Linux so a "stupid" question: Linux doesn't have a UTF16->UTF8 conversion function available?

If there's no better option #1 would work..

@janvorli
Copy link
Member

janvorli commented Nov 3, 2020

Hmm, I didn't know the UTF16 is a public contract. Too bad we didn't catch this when that was created, as UTF16 is kind of a Windowsism that we were trying to avoid on Unix (that's why the coreclr hosting APIs all use UTF-8).
As for conversion function on Linux, there is an iconv library available for Linux that can do all kinds of charset conversions. But I'd prefer not to add a dependency on a library if we don't have to.

@vitek-karas
Copy link
Member

I was just trying to push back to make sure we don't make the interface change lightly. I do agree that it should have been UTF8 from the beginning (my bad as well).

I think it should be OK to change the interface (as I said, those two components are shipped together and AFAIK there's no "extension" point - as in nobody implements its own hostpolicy, nor I've heard about anybody using the bundle probe when directly hosting coreclr.dll)

@ghost ghost locked as resolved and limited conversation to collaborators Dec 19, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Host untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants