UTF8 FormC encoding for outcomes #89

NicolasDorier · 2020-09-16T02:22:57Z

I think this one is relatively important.

UTF8 encoding is not "canonical" there is several way to encode the same characters. So when one wants to hash a string, he should use UTF-8 using Normalization Form Compatibility Decomposition (NFKD).

This is really important, as if the oracle and the users don't use the exact same way to encode the outcome, then the attestation will be meaningless.

There is several normalization form, I took NFKD mainly because it is taken by BIP39, which need to deal with the same issue.

LLFourn · 2020-09-16T08:35:11Z

This is an important point to bring up. What about both parties just don't do normalization at all?
The oracle provides an announcement signature over some (TBD) description of all the outcomes. So it makes sense to encode the character sequences in the outcome in the same way as they were in the announcement data rather than normalizing them.

Perhaps we can recommend that if the strings in the descriptor are not normalized then reject the announcement message from the oracle.

This PHP stack exchange makes it seem like NFC or NFD is the least radical: https://stackoverflow.com/questions/7931204/what-is-normalized-utf-8-all-about

NicolasDorier · 2020-09-17T03:40:31Z

éléphant
Without normalization: c3a96cc3a97068616e74
éléphant
FormD: 65cc816c65cc817068616e74
éléphant
FormC: c3a96cc3a97068616e74
éléphant
FormKD: 65cc816c65cc817068616e74
éléphant
FormKC: c3a96cc3a97068616e74

It seems that copy/paste does not keep the KD/D form.
Form KC and C at least seems to match what is written on keyboard... NFKD does not seem a good choice like you pointed out, but so does D.

I don't understand very much the compatibility part. (the K)

I see a good reason for using C: It matches what people copy/paste. So if an oracle forget to normalize, there should be not consequence in most cases.

However, I also see a good reason for using D: Because it does not match what people copy/paste, failure to normalize will be immediately appearant.

NicolasDorier · 2020-09-17T13:24:47Z

@LLFourn I change my code to use FormC. That's my impression it does not really matter as long as we pick the same.

NicolasDorier · 2020-10-01T12:38:18Z

Test vectors

https://github.com/dgarage/NDLC/blob/d816c0c517611b336f09ceaa43d400ecb5ab909b/NDLC.Tests/Data/normalization_tests.json

Christewart added this to the v0.1 milestone Sep 16, 2020

Christewart added the oracle label Sep 16, 2020

Christewart mentioned this issue Sep 28, 2020

Oracle Address proposal #99

Closed

NicolasDorier changed the title ~~UTF8 NKFD encoding for outcomes~~ UTF8 FormC encoding for outcomes Oct 1, 2020

nkohen closed this as completed Dec 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF8 FormC encoding for outcomes #89

UTF8 FormC encoding for outcomes #89

NicolasDorier commented Sep 16, 2020 •

edited

Loading

LLFourn commented Sep 16, 2020

NicolasDorier commented Sep 17, 2020 •

edited

Loading

NicolasDorier commented Sep 17, 2020

NicolasDorier commented Oct 1, 2020

UTF8 FormC encoding for outcomes #89

UTF8 FormC encoding for outcomes #89

Comments

NicolasDorier commented Sep 16, 2020 • edited Loading

LLFourn commented Sep 16, 2020

NicolasDorier commented Sep 17, 2020 • edited Loading

NicolasDorier commented Sep 17, 2020

NicolasDorier commented Oct 1, 2020

NicolasDorier commented Sep 16, 2020 •

edited

Loading

NicolasDorier commented Sep 17, 2020 •

edited

Loading