Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF8 FormC encoding for outcomes #89

Closed
NicolasDorier opened this issue Sep 16, 2020 · 4 comments
Closed

UTF8 FormC encoding for outcomes #89

NicolasDorier opened this issue Sep 16, 2020 · 4 comments
Labels
Milestone

Comments

@NicolasDorier
Copy link

NicolasDorier commented Sep 16, 2020

I think this one is relatively important.

UTF8 encoding is not "canonical" there is several way to encode the same characters. So when one wants to hash a string, he should use UTF-8 using Normalization Form Compatibility Decomposition (NFKD).

This is really important, as if the oracle and the users don't use the exact same way to encode the outcome, then the attestation will be meaningless.

There is several normalization form, I took NFKD mainly because it is taken by BIP39, which need to deal with the same issue.

@LLFourn
Copy link
Contributor

LLFourn commented Sep 16, 2020

This is an important point to bring up. What about both parties just don't do normalization at all?
The oracle provides an announcement signature over some (TBD) description of all the outcomes. So it makes sense to encode the character sequences in the outcome in the same way as they were in the announcement data rather than normalizing them.

Perhaps we can recommend that if the strings in the descriptor are not normalized then reject the announcement message from the oracle.

This PHP stack exchange makes it seem like NFC or NFD is the least radical: https://stackoverflow.com/questions/7931204/what-is-normalized-utf-8-all-about

@Christewart Christewart added this to the v0.1 milestone Sep 16, 2020
@NicolasDorier
Copy link
Author

NicolasDorier commented Sep 17, 2020

éléphant
Without normalization: c3a96cc3a97068616e74
éléphant
FormD: 65cc816c65cc817068616e74
éléphant
FormC: c3a96cc3a97068616e74
éléphant
FormKD: 65cc816c65cc817068616e74
éléphant
FormKC: c3a96cc3a97068616e74

It seems that copy/paste does not keep the KD/D form.
Form KC and C at least seems to match what is written on keyboard... NFKD does not seem a good choice like you pointed out, but so does D.

I don't understand very much the compatibility part. (the K)

I see a good reason for using C: It matches what people copy/paste. So if an oracle forget to normalize, there should be not consequence in most cases.

However, I also see a good reason for using D: Because it does not match what people copy/paste, failure to normalize will be immediately appearant.

@NicolasDorier
Copy link
Author

@LLFourn I change my code to use FormC. That's my impression it does not really matter as long as we pick the same.

@NicolasDorier NicolasDorier changed the title UTF8 NKFD encoding for outcomes UTF8 FormC encoding for outcomes Oct 1, 2020
@nkohen nkohen closed this as completed Dec 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants