Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

APIv2: protoreflect.FileDescriptor should be able to preserve source info #859

Closed
jhump opened this issue May 24, 2019 · 23 comments
Closed

Comments

@jhump
Copy link
Contributor

jhump commented May 24, 2019

With the current state of the api-v2 branch, there is no way to construct a protoreflect.FileDescriptor (using "sanctioned" APIs) and be able to retain the source info.

Is this intentional?

The only way I can find is to use the mechanism that generated code uses -- using protoimpl.FileBuilder to create it instead of protodesc.NewFile. And that involves type asserting the result to interface{ProtoLegacyRawDesc() []byte} and calling that method, which is documented as likely to be removed in the future.

Should I have to deal with that myself -- like provide a wrapper struct that embeds a file descriptor and also retains source info? It would also have to provide custom methods to convert back to a proto since the ToFileDescriptor would return a descriptor proto that does not have the source info.

Maybe NewFile should at least be parameterized (or a new peer function) that makes creation of a descriptor with source info possible.

@dsnet dsnet added the api-v2 label May 24, 2019
@dsnet
Copy link
Member

dsnet commented May 24, 2019

Thanks for the issue.

Is this intentional?

More of an oversight. The reflect/protodesc package should preserve this information round-trip. Which implies that perhaps this should also be surfaced in the reflect/protoreflect API.

@jhump
Copy link
Contributor Author

jhump commented May 24, 2019

@dsnet, thanks for the fast reply! That is reassuring to hear.

cc @pedgeio

@bufdev
Copy link

bufdev commented May 24, 2019

@dsnet do you want a hand with this or should we leave it to you? I have an interest in this being available sooner than later

@neild
Copy link
Contributor

neild commented May 24, 2019

FWIW, C++ provides a GetSourceLocation method on descriptors:

// Updates |*out_location| to the source location of the complete
// extent of this message declaration.  Returns false and leaves
// |*out_location| unchanged iff location information was not available.
bool GetSourceLocation(SourceLocation* out_location) const;

// NB, all indices are zero-based.
struct SourceLocation {
  int start_line;
  int end_line;
  int start_column;
  int end_column;

  // Doc comments found at the source location.
  // See the comments in SourceCodeInfo.Location (descriptor.proto) for details.
  std::string leading_comments;
  std::string trailing_comments;
  std::vector<std::string> leading_detached_comments;
};

Something similar doesn't seem unreasonable. I agree that we should preserve this information round-trip.

@bufdev
Copy link

bufdev commented May 24, 2019

It would be great if SourceCodeInfo_Locations were surfaced for each type, both directly and in a clean manner, ie GetStartLine(), GetEndLine(), GetStartColumn(), GetEndColumn(), GetLeadingComments(), GetTrailingComments(), GetLeadingDetachedComment() effectively

@bufdev
Copy link

bufdev commented May 29, 2019

FYI there's a bunch of code from jhump/protoreflect desc that could be ported over here with an intermediate amount of work.

https://github.com/jhump/protoreflect/blob/f0c0d116896c934f55895bc1cbaf3a43c0dd58a3/desc/descriptor.go#L311
https://github.com/jhump/protoreflect/blob/f0c0d116896c934f55895bc1cbaf3a43c0dd58a3/desc/convert.go#L80
https://github.com/jhump/protoreflect/blob/f0c0d116896c934f55895bc1cbaf3a43c0dd58a3/desc/internal/source_info.go

That might be a good starting point. I'd love to be able to use https://godoc.org/google.golang.org/protobuf/reflect/protoreflect right now but without writing a heavy adapter layer, this is what I'd need. Let me know what you think.

@neild
Copy link
Contributor

neild commented May 29, 2019

The first step is to come up with a concrete API design. I expect implementation should be fairly straightforward once we have that.

A few questions to answer:

  1. How do we represent the source info? Struct, interface, something else? Do we have separate data structures representing the path and other location information, or a combined one?
  2. Does each protoreflect.*Descriptor type have a method to look up the location information for that descriptor?
  3. How do you fetch location information not associated with a Descriptor? (A method on FileDescriptor?)

@neild
Copy link
Contributor

neild commented May 29, 2019

Note for self: Once we add this, we'll also want to update the protogen package to make use of it. (It has its own, more specialized handling of location information.)

@dsnet
Copy link
Member

dsnet commented May 29, 2019

Thanks @pedgeio for point to that prior work. I'm out of the office for this week, and will be back on Monday.

I think a relevant question to answer first is whether reflect/protoreflect should have first-class support for SourceCodeInfo or not.

  • If not, then we can simply add a FileDescriptor.SourceCodeInfo method that returns a *descriptorpb.SourceCodeInfo and document that the method either always copies the contents upon each call or document that the user must not mutate the retrieved contents. This is pretty much what we do with the Options messages.
  • If yes, then we probably want to design a structured API that only allows read-only access to the source-code information and also allows efficient look-up of locations by path.

One argument for first-class support is that the compiler/protogen package itself seems to carry along the raw descriptors only for the source code information. In fact protogen represents the location map in nearly the same way as jhump/protoreflect.

@bufdev
Copy link

bufdev commented May 29, 2019

This is just a rough sketch of what I have in my head, I'm sure there are edge cases I'm not considering right now.

To answer 1, sticking with the design of reflect/protoreflect, I'd probably propose the following:

// SourceCodeInfo represents the source code info of an element.
type SourceCodeInfo interface {
  GetStartLine() uint32
  // returns GetStartLine() if not set in SourceCodeInfo.Location
  GetEndLine() uint32
  GetStartColumn() uint32
  GetEndColumn() uint32
  GetLeadingComments() string
  GetTrailingComments() string
  GetLeadingDetachedComments() string
}

Then in the simplest form, you would just add this to protoreflect.Descriptor:

type Descriptor interface {
  ...

  // SourceCodeInfo returns the SourceCodeInfo for this Descriptor.
  //
  // Support for this functionality is optional and may return nil.
  SourceCodeInfo() SourceCodeInfo
}

This would be the simplest for the end-user.

Point three is where this gets interesting though, and touches on jhump/protoreflect#212 and jhump/protoreflect#209. FileDescriptorSets have unassociated SourceCodeInfo.Locations, and you'd like protodesc.ToFileDescriptorProto(protodesc.NewFile(input)) == input. You'd have to store these in some way, I guess as you said perhaps store these on the FileDescriptor as an optional value.

@neild
Copy link
Contributor

neild commented May 29, 2019

Note that leading_detached_comments is a repeated field, so either that needs to be a []string or some opaque representation of a list of strings.

@bufdev
Copy link

bufdev commented May 29, 2019

@neild whoops mental copy/paste error, missed that :-)

@dsnet on whether this should be a first-class citizen, I'm going to just give my vote/pleading that it should be - this would make stuff a lot easier for a lot people. You can do the lookups one time within a SourceCodeInfo and then store this information on a given Descriptor, and users don't have to rewrite that logic each time. Exposing the associated SourceCodeInfo.Location gets you half-way there, but if you're going that far, you might as well do it as a first-class citizen completely. If you do this, the lookup shouldn't have to happen - you just pretty much store a copy of the information on each type (i.e internal/prototype.fileDesc, internal/prototype.fieldDesc, etc).

@dsnet
Copy link
Member

dsnet commented May 29, 2019

I'm going to just give my vote/pleading that it should be - this would make stuff a lot easier for a lot people.

I'm inclined to agree.

Then in the simplest form, you would just add this to protoreflect.Descriptor:

The downside to this is that SourceCodeInfo is not always tied to a descriptor declaration, and can be tied to pretty much anything.


Another possible design:

type FileDescriptor interface {
    ...
    SourceLocations() SourceLocations
}

type SourceLocations interface {
    Len() int
    Get(int) SourceLocation
    ByDescriptor(Descriptor) SourceLocation
    ByPath([]int32) SourceLocation
}

type SourceLocation struct {
    StartLine, StartColumn int
    EndLine,   EndColumn   int
    LeadingComments        string
    TrailingComments       string

    // DetachedComments is a copy of the detached comments.
    // This is rarely populated, and perhaps a copy is okay.
    // Alternatively, we can make this an opaque type.
    DetachedComments []string
}

@bufdev
Copy link

bufdev commented May 29, 2019

I'd be 100% on board with that solution as well - that solves @neild's third point in a cleaner way.

A few notes:

  • I am not convinced you need to expose SourceLocations#Len or SourceLocations#Get, however maybe to make sure there are no breaking changes in the future, not a bad idea.
  • I would either make SourceLocation an interface, return a pointer to the struct, return (SourceLocation, bool), or return (SourceLocation, error) - while really FileDescriptor.SourceLocations should return nil or not nil, and if not nil, ByDescriptor always returns a valid value, I could see a situation where someone uses a Descriptor not associated with a FileDescriptor, gets back an empty SourceLocation, and continues on as if it is valid. Some way to denote this via the function call would be good.
  • I would document EndLine as 0 if it is equal to StartLine, or explicitly say it has to be set as opposed to what SourceCodeInfo.Location does.

@bufdev
Copy link

bufdev commented May 29, 2019

Semi-related: protocolbuffers/protobuf#6188

@bufdev
Copy link

bufdev commented May 29, 2019

One issue that comes to mind with the above signature for SourceLocations and re: the comment about interface vs pointer/bool/error return value - the Get(Descriptor) method might not be what you want to do, instead still exposing GetSourceCodeInfo() on each Descriptor, which internally can hold a reference to a path []int32 and then just call ByPath itself. This is to prevent the ability to call GetDescriptor for Descriptors not linked to the SourceLocations and underlying FileDescriptor.

@neild
Copy link
Contributor

neild commented May 29, 2019

We do need a way to get a full list of SourceLocations, including ones not attached to a descriptor, to permit round-tripping a FileDescriptor through reflect/protodesc.

Returning (SourceLocation, bool) seems better than returning a zero value for a missing location.

I don't have a strong opinion on SourceLocations.ByDescriptor vs. Descriptor.SourceLocation; the former isolates location information into a single place, but the latter is slightly simpler to use.

@bufdev
Copy link

bufdev commented May 29, 2019

Yea I actually re-read the proposal and realized the relative need for Len and Get due to the original problem you mentioned (problem 3).

I'd argue for Descriptor.SourceLocation so that there's no possibility of doing:

fileDescriptorOne  := ...
fileDescriptorTwo := ...
// some FieldDescriptor from fileDescriptorOne
fieldDescriptor := fileDescriptorOne.Messages().Get(1).Fields().Get(2)
// invalid
sourceLocation := fileDescriptorTwo.SourceLocations().ByDescriptor(fieldDescriptor)

@bufdev
Copy link

bufdev commented Jun 5, 2019

Just re-pinging here :-) Let me know if you need any help/hoping you had a good week off @dsnet.

One thing I noticed is that you might actually want to expose GetSourceLocationPath() on Descriptor in some way, which returns the base path for the type. This would allow you to do things like:

fileDescriptor := ...
// some FieldDescriptor from fileDescriptor
fieldDescriptor := fileDescriptor.Messages().Get(1).Fields().Get(2)
// get the base location
fileDescriptor.SourceLocations().ByPath(fieldDescriptor.GetSourceLocationPath())
// get the source location of the field name
fileDescriptor.SourceLocations().ByPath(append(fieldDescriptor.GetSourceLocationPath(), 1))
// get the source location of the field number
fileDescriptor.SourceLocations().ByPath(append(fieldDescriptor.GetSourceLocationPath(), 2))

This isn't particularly clean, but this is the general idea.

@dsnet
Copy link
Member

dsnet commented Jun 11, 2019

Just an update to avoid leaving you hanging.

Right now my attention is focused on getting CL/175458 merged, which is a signifcant change to the protoreflect.Message API in response to feedback from usage inside Google. This has higher priority since usage of reflection inside Google is quickly increasing. It is easier to get breaking changes like this in sooner than later.

This issue is my next priority afterwards. I can assure you that protoreflect will have some way to preserve source info. The exact API is TBD, but there's a lot of good discussion and proposed designs here.

@bufdev
Copy link

bufdev commented Jun 11, 2019

OK no worries - thanks for the update, really appreciate it. Let me know if I can help in any way.

@dsnet
Copy link
Member

dsnet commented Jun 20, 2019

Please take a look: https://go-review.googlesource.com/c/protobuf/+/183157

@dsnet dsnet added this to the APIv2 release milestone Jul 7, 2019
@dsnet dsnet added the blocks-v2 label Jul 7, 2019
@dsnet dsnet removed this from the APIv2 release milestone Jul 7, 2019
@dsnet dsnet removed the api-v2 label Jul 10, 2019
@dsnet
Copy link
Member

dsnet commented Jul 13, 2019

Bare minimum API for preserving SourceLocations is submitted. No convenience methods for looking up locations by path were added. I filed #899 for that.

@dsnet dsnet closed this as completed Jul 13, 2019
@golang golang locked as resolved and limited conversation to collaborators Jun 25, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants