Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Produce a complete spec for proto2 and proto3 #6188

Closed
bufdev opened this issue May 29, 2019 · 9 comments
Closed

Produce a complete spec for proto2 and proto3 #6188

bufdev opened this issue May 29, 2019 · 9 comments
Assignees
Labels
documentation inactive Denotes the issue/PR has not seen activity in the last 90 days. P3

Comments

@bufdev
Copy link

bufdev commented May 29, 2019

Last night, I read Ian Lance Taylor's commentary on Golang being Google's language, and it has a very poignant observation:

All successful languages have either a single
specification or a single primary implementation.  (Go and C++ are
examples of language based on a specification; Perl, at least before
Perl 6, is an example of a language based on an implementation).
These serve as the definition of what the language is: whatever the
specification says or whatever the implementation does.

I've been using various OSS Protobuf parsers for the better part of 18 months now, and this really hit home. Protobuf is an example of the latter, namely that protoc that serves as the definition of what the Protobuf language is. This is fine for most cases, but there is a lot of unlocked potential as to what can be built for Protobuf that is held back by this fact. The easiest example is for linters and formatters, but there are many more, and not just in the static analysis realm. The only way to be 100% certain you have a valid parsing of Protobuf is to shell out to protoc --descriptor_set_out --include_imports --include_source_info and use the generated FileDescriptorSet, which in itself is unwieldy to use for such a purpose.

As illustrative examples, I've been working in Golang primarily for a while, and there are 10-20 Protobuf parsing libraries on GitHub that attempt to handle Protobuf parsing natively in Golang, however there are only two I've found that come close to actually handling the Protobuf spec.

The first library is emicklei/proto. This is an attempt at doing a parse of Protobuf that does not deal with FileDescriptorSets, and is relatively easy to use for static analysis since it produces file:line:column information and parses comments. However, even given the extensive effort of the maintainer (Extensive with a capital "E"), it's not even close. First, emicklei/proto doesn't purport to decide whether your files are valid Protobuf, it will produce a parsed result for files that aren't valid, for example missing semicolons. This means that for any actual use of it, you first need to verify that files are valid Protobuf using protoc. Second, due to all the various undocumented edge cases, especially around options, it took a lot of work to identify and fix many parsing failures. I used emicklei/proto to build Prototool's linting and formatting functionality, and through Prototool's adoption, Prototool effectively became a massive integration test on emicklei/proto, which led us to finding a lot more issues. This trial-and-error method was the only practical way to find all the issues. While most have been found, to this day new things come up. For example, I ran emicklei/proto through a test yesterday outside of Prototool against googleapis, and just recently a multi-line field declaration was added for the first time, which led to a new issue being filed.

The second library, is jhump/protoreflect's desc/protoparse package by @jhump. Josh effectively attempted to codify the Protobuf specification himself, I am in awe of how close this comes to protoc. jhump/protoparse does actually validate whether or not a Protobuf file is valid, and it also generates FileDescriptorSets - in fact, Josh wrote goprotoc as a protoc replacement in Golang, which comes tantalizingly close to matching protoc-generated FileDescriptorSets on a byte-by-byte level. I've been hitting jhump/protoreflect and jhump/goprotoc with a tiny-sized battering ram (relative to what we could have, see below) in pedgeio/goprotoctest, and it's getting awfully close - it already produces byte-equivalent FileDescriptorSets for every directory in googleapis when using --include_imports. However, through this testing, we still identified some issues that had to be fixed, and there are some outstanding issues that Josh is working on.

I can't emphasize enough how impressed I've been with this library - I codified every issue found in emicklei/proto within goprotoctest and it passed with flying colors. However, even given Josh's gargantuan effort here, since there's no official Protobuf spec, we're left to doing trial-and-error against known Protobuf libraries, which doesn't instill a ton of confidence. While I think goprotoc is close to matching protoc, it's tough when this is the best that can be practically done.

Both of these serve to illustrate the issue with Protobuf being defined by an implementation and not a specification - if you want to do anything that doesn't use the implementation, you're out of luck. There is a proto2 spec and proto3 spec posted on developers.google.com, but the dirty secret is that neither of these are fully accurate, which is why developers like @emicklei and @jhump have to do such massive efforts to get this right. Of note, despite the fact that those specs are not accurate, there are a bunch of parsers relying on those specs (search "protobuf parser" on GitHub and look around if you are interested, or I can follow up with examples) that are being used right now.

This is holding the Protobuf community back - if you want to parse Protobuf files and do anything with them, your options are:

  • Shell out to protoc, which is not great for many reasons, and only produces FileDescriptorSets which are not always the best representation for certain kinds of work.
  • Build a library that comes close, but still requires trial-and-error, and is subject to small
    edge cases codified in protoc.

There are so many things that could be built with Protobuf if this wasn't the case.

What I'm proposing:

  1. The Protobuf team produces a complete spec for both proto2 and proto3. Josh's proto.y is a decent place to start, although Golang-specific. As part of this, a downloadable EBNF file containing the grammar would be great (I am no compilers expert however, one of the things I regret not learning).
  2. (The big one) Optimally, protoc uses this spec itself for parsing. The specification needs to be become the definition of the language, not the protoc implementation, and optimally this means that protoc uses it like any other library would. This is not 100% necessary and could have drawbacks, but given that Protobuf parsing is so quick as it is, it would be great.
  3. (Less important but nice to have) Google maintains file-to-FileDescriptorSet libraries when possible - protoc already does this since that's the main function of protoc, but if a spec were available, having this in i.e. google.golang.org/protobuf should be much easier. This would help eliminate the slight differences in generated FileDescriptorSets in other languages.

This is a major task. Actually producing a spec is probably not the biggest issue - even with the options edge cases, this should not be a big blocker. However, moving protoc to this is a potentially internet-breaking change if there are bugs. Testing byte-equivalence of FileDescriptorSets for a couple repositores like goprotoctest does wouldn't be enough, you'd want to do this integration testing across all of Google's internal Protobuf definitions. There's also the problem of the existing code people may be importing from src/google/protobuf/compiler - unless you bump to 4.0 (which I do not think you want to do, I think this should be a backwards-compatible change), you need to keep that code around.

What I'd roughly propose:

  • Create the actual specifications for proto2 and proto3.
  • Write the replacement parsing library to go from Protobuf files to FileDescriptorSets in i.e. src/google/protobuf/compiler2, and put it's use behind a flag such as --experimental_parsing as part of a release.
  • Verify byte-equivalence for FileDescriptorSets for every definition at Google you are allowed to get your hands on that will run protoc --descriptor_set_out and protoc --descriptor_set_out --experimental parsing for the matrix of --include_imports and --include_source_info on and off. How to do this would be the subject of a lot of debate, some proposals: a massive integration test against the internal monorepo, which will take forever, or rewrite internal Protobuf build rules to call --descriper_set_out and --descriptor_set_out --experimental_parsing across the include_imports/include_source_info matrix for any invocation of protoc. If you detect a difference, internally log the path to the files, and the diff. Run this for a few months. This second proposal might not go down very well, as you're going to be logging a lot of internal information you don't want to be, and it will slow down people's builds in theory, although minimally since protoc is so quick.

I know this is a big ask, but Protobuf has been getting a lot more adoption over the last few years, and for it to really go to the next level, we need a common agreed-upon spec to build tooling around. Look at Golang's static analysis tool growth as a great example of why this is so useful - a big part of why I like working in Golang is I have so much tooling I can rely on that was produced outside of Google. Beyond static analysis tools, this could eventually lead to a world where Protobuf code generators no longer need to be written as protoc plugins that take FileDescriptorSets as input, instead working directly from the Protobuf files themselves.

Let me know any thoughts here, this is just a proposal, easier said than done. I appreciate the time you took to read this if you've gotten this far :-)

@jhump
Copy link
Contributor

jhump commented May 29, 2019

I wholeheartedly agree that a published official spec would be hugely valuable to the community.

A spec should be a lot more than just the grammar. The grammar defines the syntax (and is currently available on the developer site, albeit old/wrong), but a spec should also be clear about semantics. Formal language specs usually interleave the two. The Go spec is a relatively simple one and I think a good example: https://golang.org/ref/spec. It is typical in that it describes the type system and other language constraints with prose. (While there are actually formal ways to describe semantics and type systems, but I don't think I've ever seen them used outside of a compilers course.)

The developers site has a lot of information on the semantics. But it would be nice if it were wound up with the syntax in a real language spec. Also, there are some places where documentation is quite light (especially around some of the more complicated aspects, such as how options are validated/interpreted).

I believe it is possible to produce a solid spec without adding an experimental parser. The downside is that they can get out of sync (changes in the parser not properly reflected in the spec and vice versa). But, from what I understand, most significant compilers use hand-written parsers for (1) performance or (2) ability to construct more user-friendly error messages when syntax issues are encountered (or both).

As far as making sure a hand-written parser stays in sync w/ the published grammar, perhaps a fuzzer could handle that job. I'm thinking something that can examine the spec and from it produce all kinds of valid and invalid inputs. It could then feed those through the hand-written parser to make sure it accepts valid programs and rejects invalid ones, per the grammar. (I guess that could very well be a significant effort as well... TBH, I think most languages don't even try to have that sort of validation and instead use careful code review guidelines to make sure the two don't get out of sync...)

@ghost
Copy link

ghost commented May 31, 2019

Yes, this is a very big ask indeed. It is a problem that we are aware of, and would like to address, but it is not something that we have bandwidth to prioritize at this moment. That said, we have created https://github.com/protocolbuffers/protobuf-grammar repo, and would love to get community contribution to it!

@ghost ghost added the help wanted label May 31, 2019
@wangkirin
Copy link
Contributor

@haon4 good news , I also found that repo few days ago but it seems there is nothing in it
I'm interested in contributing on it , do you have any plan of milestone about that project ?

@bufdev
Copy link
Author

bufdev commented Jun 1, 2019

This as a community project defeats the purpose of creating a definitive grammar/spec that protocolbuffers/protobuf holds itself to - without that component, having a community-written spec is no improvement over the current situation from what I can see, and in fact could create even more fragmentation and confusion, just my two cents.

@dsnet
Copy link
Contributor

dsnet commented Jun 5, 2019

It's not so much a "community-written spec" that I see as an issue, but that there needs to be a formal statement that the spec is the authority on correctness, not the C++ protoc implementation, as has been the de-facto pseudo-standard.

In other words, once a spec is written, any divergence from the spec in an implementation should be a considered a bug and the implementation fixed rather than the spec (except for exceptional cases).

For example, in the early days of Go there were two implementations of Go, the original gc compiler and the gccgo compiler. Divergence in behavior were often deemed a bug and the spec was the authority on defining correct behavior. Sometimes gc had to change behavior, sometimes gccgo had to change. Rarely was the spec changed to suit either implementation.

@nilslice
Copy link
Contributor

nilslice commented Jun 7, 2019

I'd like to submit a PR to Google's hiring staff to merge @pedgeio onto the team to work on said spec!! 😄

@ghost
Copy link

ghost commented Jun 10, 2019

To be honest, I highly doubt anyone within Google can come up with a spec and say that it correctly matches the protoc implementation either. What I am really looking for is this:

  1. A draft specification of the protoc parser
  2. A set conformance tests to ensure all parser implementations are based on the spec

Once (1) and (2) are done, and we have good confidence that they match what we want, then we can declare that the specification is finalized, and any divergence from the spec should be fixed. If we do need to update the spec in the future, we will need to come up with a process to notify all parser implementations, as well as providing new conformance tests to include the changes in the revision.

If you would like to contribute, please send PRs to https://github.com/protocolbuffers/protobuf-grammar. We can help review them.

@jvolkman
Copy link
Contributor

jvolkman commented Sep 1, 2022

There's now a published spec for text format, which is also used in the .proto language for long-form custom option values.

https://developers.google.com/protocol-buffers/docs/text-format-spec

Copy link

We triage inactive PRs and issues in order to make it easier to find active work. If this issue should remain active or becomes active again, please add a comment.

This issue is labeled inactive because the last activity was over 90 days ago.

@github-actions github-actions bot added the inactive Denotes the issue/PR has not seen activity in the last 90 days. label Apr 20, 2024
@bufdev bufdev closed this as completed Apr 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation inactive Denotes the issue/PR has not seen activity in the last 90 days. P3
Projects
None yet
Development

No branches or pull requests

9 participants