Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for different encodings #2385

Merged
merged 2 commits into from
May 17, 2022
Merged

Conversation

Raudius
Copy link
Contributor

@Raudius Raudius commented May 11, 2022

Summary

Previously we would just assume files would use UTF-8. Now we make sure to convert files to UTF-8 when we open them. Without configuration just a small set of encodings are supported:

 'UTF-8', 'GB2312', 'GBK', 'BIG-5', 'SJIS-win', 'EUC-JP', 'Windows-1252', 'ISO-8859-15', 'ISO-8859-1', 'ASCII'

More encodings can be configured via the PHP config value: mbstring.detect_order

@Raudius Raudius requested review from a team, vinicius73 and max-nextcloud and removed request for a team May 11, 2022 14:12
@Raudius Raudius force-pushed the enh/support_encodings branch 6 times, most recently from de3fb09 to 4a725a3 Compare May 12, 2022 15:30
@vinicius73
Copy link
Member

As we are using mbstring functions, it is important to define it in composer.json
Although it is already required by server.

https://github.com/nextcloud/server/blob/9e1d4d17ec792d8abcec2f9262472f5bb7a43acf/composer.json#L21-L29

--- a/composer.json
+++ b/composer.json
@@ -17,7 +17,8 @@
         }
     ],
     "require": {
-      "php": "^7.3|^8.0"
+      "php": "^7.3|^8.0",
+      "ext-mbstring": "*"
     },
     "scripts": {

What do you think?

@Raudius Raudius force-pushed the enh/support_encodings branch 2 times, most recently from 207a4da to 2919658 Compare May 13, 2022 10:37
@juliusknorr
Copy link
Member

@Raudius Could you do a rebase to update the branch rather then merge? That would make the commit history linear and a bit cleaner :)

@juliusknorr
Copy link
Member

Found two more reportings on other encoding issues when doing a bit of triage:
#1633
#824

@Raudius Raudius force-pushed the enh/support_encodings branch 5 times, most recently from 5299d57 to ff8f19a Compare May 16, 2022 16:45
Comment on lines 47 to 58
$detected_encoding = $this->encodingService->detectEncoding(file_get_contents($file));
if ($this->isSupportedEncoding($encoding)) {
$this->assertNotNull($detected_encoding);
}

$original_order = mb_detect_order();
$this->assertNotFalse(mb_detect_order($encoding));

$detected_encoding = $this->encodingService->detectEncoding(file_get_contents($file));
$this->assertEquals($encoding, $detected_encoding);

mb_detect_order($original_order);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nitpick on the code style 😉 I guess we currently don't run php-cs-fixer on the tests directory.

Suggested change
$detected_encoding = $this->encodingService->detectEncoding(file_get_contents($file));
if ($this->isSupportedEncoding($encoding)) {
$this->assertNotNull($detected_encoding);
}
$original_order = mb_detect_order();
$this->assertNotFalse(mb_detect_order($encoding));
$detected_encoding = $this->encodingService->detectEncoding(file_get_contents($file));
$this->assertEquals($encoding, $detected_encoding);
mb_detect_order($original_order);
$detectedEncoding = $this->encodingService->detectEncoding(file_get_contents($file));
if ($this->isSupportedEncoding($encoding)) {
$this->assertNotNull($detectedEncoding);
}
$originalOrder = mb_detect_order();
$this->assertNotFalse(mb_detect_order($encoding));
$detectedEncoding = $this->encodingService->detectEncoding(file_get_contents($file));
$this->assertEquals($encoding, $detectedEncoding);
mb_detect_order($originalOrder);

Copy link
Member

@juliusknorr juliusknorr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor code style comment left on the tests, otherwise LGTM

@Raudius Raudius force-pushed the enh/support_encodings branch 2 times, most recently from cdd653a to 86a2c38 Compare May 17, 2022 06:54
Signed-off-by: Raul <raul@nextcloud.com>
Signed-off-by: Raul <raul@nextcloud.com>
@juliusknorr juliusknorr merged commit c72b9e1 into master May 17, 2022
@delete-merged-branch delete-merged-branch bot deleted the enh/support_encodings branch May 17, 2022 07:51
@juliusknorr
Copy link
Member

/backport to stable24

@juliusknorr
Copy link
Member

/backport to stable23

@juliusknorr
Copy link
Member

/backport to stable22

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backported successfully backported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can't show Chinese word corrlecty, Maybe it's because GB2312 and GBk encoding is not supported
4 participants