Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 issue when decode back #37

Open
industral opened this issue Apr 11, 2023 · 1 comment
Open

UTF-8 issue when decode back #37

industral opened this issue Apr 11, 2023 · 1 comment

Comments

@industral
Copy link

Using official example from README, but having UTF-8 quotes

Encoded this string looks like:  [
   1212, 318,   281, 1672,
    564, 250, 34086,  594,
    447, 250,   284, 1949,
  21004, 503,   319,    0
]
We can look at each token and what it represents
{ token: 1212, string: 'This' }
{ token: 318, string: ' is' }
{ token: 281, string: ' an' }
{ token: 1672, string: ' example' }
{ token: 564, string: ' �' }
{ token: 250, string: '�' }
{ token: 34086, string: 'sent' }
{ token: 594, string: 'ence' }
{ token: 447, string: '�' }
{ token: 250, string: '�' }
{ token: 284, string: ' to' }
{ token: 1949, string: ' try' }
{ token: 21004, string: ' encoding' }
{ token: 503, string: ' out' }
{ token: 319, string: ' on' }
{ token: 0, string: '!' }
We can decode it back into:
 This is an example “sentence“ to try encoding out on!

You can see, that when we decode it back token by token - it wont decode in correct way, but it's OK when you decode entire array back.

@niieani
Copy link

niieani commented Apr 15, 2023

That's because of how UTF-16 surrogate pairs work. You need enough token groups to make up a full emoji.
If you'd like to decode token-by-token, use the decodeGenerator or decodeAsyncGenerator (if you have tokens flowing in asynchronously) from my version in #38.

See my example here to get a better understanding of the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants