Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF_16LE.encode does not encode string to UTF-16 LE correctly? #31

Closed
itn3000 opened this issue Apr 20, 2018 · 6 comments
Closed

UTF_16LE.encode does not encode string to UTF-16 LE correctly? #31

itn3000 opened this issue Apr 20, 2018 · 6 comments

Comments

@itn3000
Copy link

itn3000 commented Apr 20, 2018

Environment

rustc --version output:

rustc 1.27.0-nightly (0b72d48f8 2018-04-10)

and my encoding_rs version is 0.7.2.

Steps to reproduce

run the following program

extern crate encoding_rs;

use encoding_rs::UTF_16LE;

fn main() {
    let s = "aa";
    let (bytes, enc, unmappable) = UTF_16LE.encode(s);
    let (dec, enc, unmappable) = UTF_16LE.decode(&bytes);
    for i in dec.chars() {
        println!("{}", i as i32)
    }
    println!("{}", dec);
}

Expected

output following text

97
0
97
0
aa

Actual

output following text(24929 = 0x6161)

24929
慡
@hsivonen
Copy link
Owner

This is expected documented behavior.

The lack of UTF-16LE/BE encoders is mentioned in the rustdocs (also at the top of the docs for Encoding). The spec shows no encoders for them.

See also this tweet and the follow-up tweet.

@hsivonen
Copy link
Owner

let (bytes, enc, unmappable) = UTF_16LE.encode(s);

Note that the phenomenon you discovered is precisely why the enc part is in the return tuple.

@itn3000
Copy link
Author

itn3000 commented Apr 20, 2018

Thank you for response.
I got it.

@udoprog
Copy link

udoprog commented Nov 11, 2019

Got bit by this thinking that encoding_rs would actually abide by the configuration it was given. This seems more appropriate for some kind of error in my opinion rather than silently* ignoring the requested encoding.

*: when using different APIs like encode_from_utf8_without_replacement.

Note that I'm using encoding_rs in a non-browser context. So I can't take the position that other encodings just shouldn't be used. I do however respect that as the stance of this project!

@hsivonen
Copy link
Owner

Yes, the behavior arising the from the "output encoding" concept in the Encoding Standard is probably surprising (but documented!) when applied to situations other that URL parsing and HTML form submission. Still, as documented, this while this crate has non-Web secondary uses, the design follows the Encoding Standard to the point that secondary things like mappings from Windows code page numbers are outside this crate.

As noted earlier, for non-Web uses that for legacy reasons require UTF-16LE or UTF-16BE output to be generated to conform to format that requires either UTF-16LE or UTF-16BE (so that using UTF-8 doesn't work), there seems to be a need in the Rust ecosystem for a crate for that purpose. However, this crate is not that crate.

Notably, such functionality is unlikely to need to fit into the Encoding objects from this crate. Situations that accept the run-time dynamism of some &'static Encoding usually mean you can choose to use encoding_rs::UTF_8. Situations where only either UTF-16LE or UTF-16BE is required tend to be statically knows, so it would be OK to have a UTF-16LE/BE encoder crate that does not integrate with encoding_rs::Encoding.

@udoprog
Copy link

udoprog commented Nov 11, 2019

Right. The issue I raised is that certain APIs don't communicate output encoding AFAIU, like encode_from_utf8_without_replacement. And it would be less of a footgun if they did somehow.

Besides that, std and byteorder combined has everything you need to actually perform the encoding:

(Taken from my project)

fn encode_utf16<B>(buf: &mut Vec<u8>, s: &str)
where
    B: byteorder::ByteOrder,
{
    for c in s.encode_utf16() {
        buf.extend(std::iter::repeat(0x0).take(2));
        let s = buf.len() - 2;
        B::write_u16(&mut buf[s..], c);
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants