Canonical form #5

Kubuxu · 2016-09-26T07:38:12Z

Base 128 varints, aka Protobuf Varints, in theory can be expanded to more than one byte even if there is no need for it.
Example:
0000 0001 is one
and
1000 0000, 0000 0001 is also a one.

We should decide what to do with them, either allow them to be stored and used or just accept them but canonize them on a fly or just disallow them to prevent hash ambiguity.

The text was updated successfully, but these errors were encountered:

Kubuxu · 2016-09-26T07:39:46Z

In my opinion it might be worth considering making them invalid and thus preventing their usage. As those varints won't be user input, it means that they can be created only by software and accepting those can lead to software that produces them even if they are discouraged.

jbenet · 2016-11-12T20:23:40Z

protobuf varints are in wide use for cryptographic applications. so if the claim is that protobuf varints are inconsistent/unstable, then we should find out what other crypto systems using protobufs do.
I believe all protobuf implementations default to the most compressed varint representation. meaning superfluous zeros are not created, though they can be read.
There seem to be two claims:
- (a) "protobuf varint implementations read varints which may have superfluous leading zeros, which change the hash of an "otherwise-equal" value"
- (b) "protobuf varint implementations write varints which may have superfluous leading zeros" (i believe this is false).
(b) would definitely be a problem, but i think it's false.
I'm not sure (a) is a problem.
if we say (a) is a problem, im not sure what we would do with protobufs users have and want to extract exactly the same.

I do know that:

o1 = <user-object>
d1 = protobuf.EncodeWithSuperfluousZeros(o1)
o2 = protobuf.Encode(d1)
d2 = protobuf.Decode(o2)

bytes.Equal(d2, d2) # may be false !!!

and this is definitely wonky.

but i haven't thought through alternatives to know that similar or worse wonkiness isn't to be found on the other end. have you? what makes you absolutely certain that no edgecase wonkiness like this is at the other side?

Kubuxu · 2016-11-12T20:57:28Z

I haven't found any software currently that does the case (b) but it doesn't include a chance that someone creates some software. I am unsure about implications of something like that apart from infinity many addresses to exactly the same object, possible increased storage size if ever some software produces such address in link, as de-duplication of linking objects won't work.

Kubuxu · 2016-11-12T21:03:18Z

In the multiformats/multicodec#16 @diasdavid wanted to define SHA1 as: 0b00010001 and udp as 0b10000000 00010001 which decode to the same value but would mean different things. This was IMO invalid use of varints and thus my concerns raised.

The issue itself isn't really about protobufs but varints inside multiformats suite.

ivan386 · 2016-12-18T02:02:55Z

I changed varint in golang/protobuf so now may not be the same values.

New varint with unique values
In the old code, they have the same value (0)
In the new code, they have:
varint [0] = 0
varint [128 0] = 128
varint [128 128 0] = 16512

func EncodeVarint(x uint64) []byte {
	var buf [maxVarintBytes]byte
	var n int
	for n = 0; x > 127; n++ {
		buf[n] = 0x80 | uint8(x&0x7F)
		x >>= 7
		x -= 1 // it added
	}
	buf[n] = uint8(x)
	n++
	return buf[0:n]
}

func DecodeVarint(buf []byte) (x uint64, n int) {
	for shift := uint(0); shift < 64; shift += 7 {
		if n >= len(buf) {
			return 0, 0
		}
		b := uint64(buf[n])
		n++
		x += (b & 0xFF) << shift // is changed
		if (b & 0x80) == 0 {
			return x, n
		}
	}

	// The number is too large to represent in a 64-bit value.
	return 0, 0
}

jmitchell · 2017-03-06T00:53:22Z

If different unsigned-varints can possibly decode to the same natural number, the specification should address how unsigned-varints are compared for equality. Without that, developers may be tempted to compare at the byte-level, and there could be multiple (and possibly conflicting) identifiers for the same thing.

A canonical encoding would make byte-level comparisons acceptable. To clarify what I (and I think @Kubuxu) mean by canonical encoding, given the functions

encode :: Nat -> [Byte]
decode :: [Byte] -> Nat

for all Nats n it should be true that

decode (encode n) == n

However, some bytes may be invalid and not map to a Nat, e.g. 1 << 15, so it wouldn't be the case that encode (decode bs) == bs for all [Byte]s bs.

(a) "protobuf varint implementations read varints which may have superfluous leading zeros, which change the hash of an "otherwise-equal" value"
...
if we say (a) is a problem, im not sure what we would do with protobufs users have and want to extract exactly the same.

Parsers tend to be flexible in what they consume, even accepting undesirable encodings. This strategy can be beneficial in situations where the intent of the encoded information won't be misconstrued. However, in this case where unsigned-varints are intended to signify identity it's especially important to have a cohesive explicit strategy in the spec.

If undesirable unsigned-varint byte arrays (e.g. 1 << 15) are accepted, their interpretation should be specified. Additionally, if we go this route it may be worth talking about supporting two separate multiformat hashing input strategies:

raw bytes (I'm fairly sure this is what's used now)
parsed AST, so inconsistencies like multiple distinct unsigned-varints decoding to the same identifier go away

Stebalien · 2021-04-21T21:49:48Z

This has been fixed.

Kubuxu mentioned this issue Nov 9, 2016

The unified multicodecs theory multiformats/multicodec#16

Merged

2 tasks

jmitchell mentioned this issue Mar 6, 2017

Is 0 supported? How? #10

Closed

Stebalien closed this as completed Apr 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Canonical form #5

Canonical form #5

Kubuxu commented Sep 26, 2016

Kubuxu commented Sep 26, 2016

jbenet commented Nov 12, 2016

Kubuxu commented Nov 12, 2016

Kubuxu commented Nov 12, 2016

ivan386 commented Dec 18, 2016 •

edited

Loading

jmitchell commented Mar 6, 2017 •

edited

Loading

Stebalien commented Apr 21, 2021

Canonical form #5

Canonical form #5

Comments

Kubuxu commented Sep 26, 2016

Kubuxu commented Sep 26, 2016

jbenet commented Nov 12, 2016

Kubuxu commented Nov 12, 2016

Kubuxu commented Nov 12, 2016

ivan386 commented Dec 18, 2016 • edited Loading

jmitchell commented Mar 6, 2017 • edited Loading

Stebalien commented Apr 21, 2021

ivan386 commented Dec 18, 2016 •

edited

Loading

jmitchell commented Mar 6, 2017 •

edited

Loading