Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support encoding UTF-16 #6549

Closed
wants to merge 1 commit into from
Closed

Conversation

kirawi
Copy link
Member

@kirawi kirawi commented Apr 2, 2023

Resolves #6542

Comment on lines +396 to +404
if encoding == encoding::UTF_16BE {
for ch in rope.chunks().flat_map(|chunk| chunk.encode_utf16()) {
writer.write_u16(ch).await?;
}
} else if encoding == encoding::UTF_16LE {
for ch in rope.chunks().flat_map(|chunk| chunk.encode_utf16()) {
writer.write_u16_le(ch).await?;
}
} else {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the only actual changes. Everything else is because of indentation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a lot of duplication and not fully clear on why UTF-16 would need a custom workaround?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary because encoding-rs doesn't support encoding utf-16. It only supports decoding utf-16

@kekePower
Copy link

Hi.

I tried this patch with the file supplied in the bug report and saw no changes after opening and saving.

% file utf16-6.txt
utf16-6.txt: ISO-8859 text

@Alexis-Lapierre
Copy link
Contributor

Alexis-Lapierre commented Apr 2, 2023

When to_writer is called on utf16.txt, it get an Encoding { windows-1252 } as argument, so the code in the pull request is not executed.

Due to this, the behaviour of editing utf16.txt is the same that on master.

Edit: This value is from from_reader when using chardetng::EncodingDetector, and the encoding detector guess windows-1252.

@kirawi
Copy link
Member Author

kirawi commented Apr 2, 2023

So it looks like chardetng doesn't support detecting UTF-16 as it expects it to be handled by the BOM (which #6497 should handle).

@Alexis-Lapierre
Copy link
Contributor

Alexis-Lapierre commented Apr 2, 2023

So it looks like chardetng doesn't support detecting UTF-16 as it expects it to be handled by the BOM (which #6497 should handle).

To detect BOM status on #6497 I used encoding_rs for_bom to search if bom is present. We could adapt what I did to something like this in from_reader:

let encoding = encoding
    .or_else(|| {
        encoding::Encoding::for_bom(&buf)
            .map(|(encoding, _)| encoding)
        })
    .unwrap_or_else(|| {
        let mut encoding_detector = chardetng::EncodingDetector::new();
        encoding_detector.feed(&buf, is_empty);
        encoding_detector.guess(None, true)
});

encoding_rs for_bom can detect UTF8, UTF16BE and UTF16LE for us, and we fall back on chardetng if no BOM is found.

@kirawi kirawi added E-easy Call for participation: Experience needed to fix: Easy / not much A-helix-term Area: Helix term improvements S-waiting-on-review Status: Awaiting review from a maintainer. C-bug Category: This is a bug labels Apr 2, 2023
}
} else if encoding == encoding::UTF_16LE {
for ch in rope.chunks().flat_map(|chunk| chunk.encode_utf16()) {
writer.write_u16_le(ch).await?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling the writer directly is pretty ineffient and will cause a ton of syscalls. You want to instead use the buffering code below.

I think a better approach would be to essentially create an.

enum Encoder {
 Utf16Be
 Utf16Le,
 encodingRs(..)
}

and then implement utf16 encoding with the same interface the encoding_rs uses so we don't duplicate the buffering logic here

@pascalkuthe
Copy link
Member

@kirawi #6497 has an implementation for encoding utf-16 that incorporates the review comments from here and has some tests so we could close this PR in favor of that

@kirawi kirawi closed this Apr 26, 2023
@kirawi kirawi deleted the utf-16-encoding branch April 26, 2023 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-helix-term Area: Helix term improvements C-bug Category: This is a bug E-easy Call for participation: Experience needed to fix: Easy / not much S-waiting-on-review Status: Awaiting review from a maintainer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UTF16: reading file is fine, but writing file is garbage
5 participants