Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Canonical form #5

Closed
Kubuxu opened this issue Sep 26, 2016 · 7 comments
Closed

Canonical form #5

Kubuxu opened this issue Sep 26, 2016 · 7 comments

Comments

@Kubuxu
Copy link
Member

Kubuxu commented Sep 26, 2016

Base 128 varints, aka Protobuf Varints, in theory can be expanded to more than one byte even if there is no need for it.
Example:
0000 0001 is one
and
1000 0000, 0000 0001 is also a one.

We should decide what to do with them, either allow them to be stored and used or just accept them but canonize them on a fly or just disallow them to prevent hash ambiguity.

@Kubuxu
Copy link
Member Author

Kubuxu commented Sep 26, 2016

In my opinion it might be worth considering making them invalid and thus preventing their usage. As those varints won't be user input, it means that they can be created only by software and accepting those can lead to software that produces them even if they are discouraged.

@jbenet
Copy link
Member

jbenet commented Nov 12, 2016

  • protobuf varints are in wide use for cryptographic applications. so if the claim is that protobuf varints are inconsistent/unstable, then we should find out what other crypto systems using protobufs do.
  • I believe all protobuf implementations default to the most compressed varint representation. meaning superfluous zeros are not created, though they can be read.
  • There seem to be two claims:
    • (a) "protobuf varint implementations read varints which may have superfluous leading zeros, which change the hash of an "otherwise-equal" value"
    • (b) "protobuf varint implementations write varints which may have superfluous leading zeros" (i believe this is false).
  • (b) would definitely be a problem, but i think it's false.
  • I'm not sure (a) is a problem.
  • if we say (a) is a problem, im not sure what we would do with protobufs users have and want to extract exactly the same.

I do know that:

o1 = <user-object>
d1 = protobuf.EncodeWithSuperfluousZeros(o1)
o2 = protobuf.Encode(d1)
d2 = protobuf.Decode(o2)

bytes.Equal(d2, d2) # may be false !!!

and this is definitely wonky.

but i haven't thought through alternatives to know that similar or worse wonkiness isn't to be found on the other end. have you? what makes you absolutely certain that no edgecase wonkiness like this is at the other side?

@Kubuxu
Copy link
Member Author

Kubuxu commented Nov 12, 2016

I haven't found any software currently that does the case (b) but it doesn't include a chance that someone creates some software. I am unsure about implications of something like that apart from infinity many addresses to exactly the same object, possible increased storage size if ever some software produces such address in link, as de-duplication of linking objects won't work.

@Kubuxu
Copy link
Member Author

Kubuxu commented Nov 12, 2016

In the multiformats/multicodec#16 @diasdavid wanted to define SHA1 as: 0b00010001 and udp as 0b10000000 00010001 which decode to the same value but would mean different things. This was IMO invalid use of varints and thus my concerns raised.

The issue itself isn't really about protobufs but varints inside multiformats suite.

@ivan386
Copy link

ivan386 commented Dec 18, 2016

I changed varint in golang/protobuf so now may not be the same values.

New varint with unique values
In the old code, they have the same value (0)
In the new code, they have:

varint [0] = 0
varint [128 0] = 128
varint [128 128 0] = 16512
func EncodeVarint(x uint64) []byte {
	var buf [maxVarintBytes]byte
	var n int
	for n = 0; x > 127; n++ {
		buf[n] = 0x80 | uint8(x&0x7F)
		x >>= 7
		x -= 1 // it added
	}
	buf[n] = uint8(x)
	n++
	return buf[0:n]
}

func DecodeVarint(buf []byte) (x uint64, n int) {
	for shift := uint(0); shift < 64; shift += 7 {
		if n >= len(buf) {
			return 0, 0
		}
		b := uint64(buf[n])
		n++
		x += (b & 0xFF) << shift // is changed
		if (b & 0x80) == 0 {
			return x, n
		}
	}

	// The number is too large to represent in a 64-bit value.
	return 0, 0
}

@jmitchell
Copy link

jmitchell commented Mar 6, 2017

If different unsigned-varints can possibly decode to the same natural number, the specification should address how unsigned-varints are compared for equality. Without that, developers may be tempted to compare at the byte-level, and there could be multiple (and possibly conflicting) identifiers for the same thing.

A canonical encoding would make byte-level comparisons acceptable. To clarify what I (and I think @Kubuxu) mean by canonical encoding, given the functions

encode :: Nat -> [Byte]
decode :: [Byte] -> Nat

for all Nats n it should be true that

decode (encode n) == n

However, some bytes may be invalid and not map to a Nat, e.g. 1 << 15, so it wouldn't be the case that encode (decode bs) == bs for all [Byte]s bs.

(a) "protobuf varint implementations read varints which may have superfluous leading zeros, which change the hash of an "otherwise-equal" value"
...
if we say (a) is a problem, im not sure what we would do with protobufs users have and want to extract exactly the same.

Parsers tend to be flexible in what they consume, even accepting undesirable encodings. This strategy can be beneficial in situations where the intent of the encoded information won't be misconstrued. However, in this case where unsigned-varints are intended to signify identity it's especially important to have a cohesive explicit strategy in the spec.

If undesirable unsigned-varint byte arrays (e.g. 1 << 15) are accepted, their interpretation should be specified. Additionally, if we go this route it may be worth talking about supporting two separate multiformat hashing input strategies:

  1. raw bytes (I'm fairly sure this is what's used now)
  2. parsed AST, so inconsistencies like multiple distinct unsigned-varints decoding to the same identifier go away

@Stebalien
Copy link
Member

This has been fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants