Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: Add protobuf builtins and column type #47534

Open
dt opened this issue Apr 15, 2020 · 17 comments
Open

sql: Add protobuf builtins and column type #47534

dt opened this issue Apr 15, 2020 · 17 comments
Labels
A-sql-builtins SQL built-in functions and semantics thereof. A-sql-datatypes SQL column types usable in table descriptors. E-starter Might be suitable for a starter project for new employees or team members. good first issue

Comments

@dt
Copy link
Member

dt commented Apr 15, 2020

We currently store encoded protobuf messages in bytes columns in many cases, as they neatly handle collecting and encoding system-specific fields. However currently this has the downside of making that column totally opaque to SQL interactions -- in order to manually read a given field, or even harder, manipulate one, we end up copy-pasting those bytes out protoc or custom-written one-off programs. This complicates debugging during incidents (when we often want to peak into system tables which contain encoded protos in ways that sometimes don't have pre-written debugging tools to expose them) and is just generally source of friction.

Instead, it would be nice if cockroach could natively decode, and ideally also encode, proto messages stored in bytes columns. Obviously this requires that cockroach have the message descriptor.

One quick win here could be a pair of builtin functions: proto_decode(msg bytes, desc bytes) -> jsonb and proto_encode(msg json, desc bytes) -> bytes which translate bytes to/from json, using the supplied message descriptor.

Even nicer though would be to have a native proto column type, which is parameterized with a supplied message descriptor. With this persisted in the TableDescriptor, the column could then be rendered as JSON natively in SQL transparently, without the cruft of the builtins or constantly re-supplying the message descriptor.

Ideally we'd also extend our ALTER .. SET TYPE support to include updating the persisted message descriptor, and maybe even allow upgrading a bytes column to/from a proto column in place (since this would just be changing the type and desc in the metadata, as the on-disk data is the same either way).

The builtin approach is likely easier, and even complementary with the native type, as it lays the groundwork and provides a compatibility bridge for those, like us, with existing bytes columns.

Jira issue: CRDB-4401

@dt dt added good first issue E-starter Might be suitable for a starter project for new employees or team members. A-sql-builtins SQL built-in functions and semantics thereof. A-sql-datatypes SQL column types usable in table descriptors. labels Apr 15, 2020
@dt dt changed the title sql: Add protobuf column type sql: Add protobuf builtins and column type Apr 15, 2020
@lancerutkin
Copy link
Contributor

Wouldn't technically be my first issue, but I'd like to try this if that's ok. Will probably take the builtin+native type approach as I've spent time in builtin before. Looks like for the proto type, I'd be working in pkg/sql/types/types.go, with new tests going in types_test.go in that same directory. Let me know if I'm off-base on that.

If you know of any particular existing types off of which I might want to model my work, that'd be extremely helpful.

@otan
Copy link
Contributor

otan commented Apr 18, 2020

I'd suggest avoid adding a new type initially as it triggers a lot of tests to break because it assumes the type can be encoded to, decoded from, etc. dumpable, printable, etc... You can follow along these PRs (#47171) (#42481) to see it's a little....much. Maybe later as an extension (but it probably deserves an RFC or at least a one-pager describing format, encoding, display, dumping, etc).

For the builtin approach, proto_decode and proto_encode shouldn't need a new type (at least initially) - you can just rely on it being bytes. It should probably just take in two byte arrays, figure out the descriptor from the descriptor arg, and do the decoding/encoding from there. I think a sample query may be helpful to make sure we understand it (and for something to test against).

@dt
Copy link
Member Author

dt commented Apr 18, 2020

As far as a test case, the one that made me wish we had this would be something like:

select proto_decode(descriptor, '\x...') from system.descriptor;

where '\x...' is the bytes for the (proto message) descriptor for our cockroach.sql.sqlbase.Descriptor message (in structured.proto).

@dt
Copy link
Member Author

dt commented Apr 18, 2020

Actually, after refreshing my memory of how the proto descriptor stuff works, I guess we might need the whole encoded FileDescriptorSet, not just the one message descriptor, since it could embed other messages? I guess that would mean we'd need another arg as well -- the name of the message type within that set? I haven't played with this recently enough to be sure though.

@petermattis
Copy link
Collaborator

Where do the desc bytes for the proto_decode() builtin come from? Unless specifying desc bytes is easy, the proto_decode() builtin doesn't seem like a large improvement over the current situation. Am I missing something here?

Adding a PROTO type seems like a much nicer long term solution. We're laying the groundwork for custom types in 20.2 with Enum. I suppose we should take care to make that support general.

@dt
Copy link
Member Author

dt commented Apr 19, 2020

Yeah, I agree the native type sounds much better, but the built-ins seems like a good place to start? I think the built-in wouldn't be that bad when used in an app if that app has that proto linked in? I think it'd look something like

eventProto, _ := EventDetail{}.Descriptor()
...
res, err := db.Query(
  `SELECT ... FROM events WHERE proto_decode(details, $1)->category = ...`,  eventProto,
)

Interactive shell would be a bit clunkier, but even then, I could still see it being better than nothing: I can ask protoc to dump the desc once and then paste that in a query like the above if I'm in a bind, which is still better than, say, having to go build a custom binary or something.

You could also imagine making a table proto_descs name -> desc, then use a subquery to avoid going back to protoc and pasting repeatedly, e.g. proto_decode(details, (SELECT desc FROM proto_desc WHERE name = 'EventDetail')) if you wanted.

But yeah, I think the ideal is a native column type that is parameterized with the file desc set and expected message name so you can set once and then just interact with it seamlessly.

@thoszhang
Copy link
Contributor

Summarizing some discussions elsewhere: It would be good if this got us closer to building an index on the descriptor type in system.descriptor, to avoid scanning the entire table when, e.g., we only need to fetch user-defined types (as noted in the RFC).

@lancerutkin
Copy link
Contributor

Sorry for the delay in updating, but I don't think I'll be able to get to this in a timely fashion. In case anyone else was looking to jump in, I don't want to block them.

craig bot pushed a commit that referenced this issue Aug 1, 2020
52169: sql: Implement crdb_internal.pb_to_json function r=miretskiy a=miretskiy

Informs #47534

`crdb_internal.pb_to_json` function can convert any protocol
message linked into our binary to an JSONB object.

The function take 2 arguments: the fully qualified name of the
protocol message, and the encoded bytes for that message.
The returned JSONB object can then be queried, using regular JSONB
primitives.

```
SELECT  crdb_internal.pb_to_json('cockroach.sql.sqlbase.Descriptor', descriptor)->'Union'->'Table'
FROM system.descriptor where id=15;
```

Release Notes: None

Co-authored-by: Yevgeniy Miretskiy <yevgeniy@cockroachlabs.com>
@cpustejovsky
Copy link
Contributor

To make sure I understand the issue as it currently stands, PR #52169 created the builtin functions, but cockroach still is not able to natively decode and encode proto messages stored in bytes columns given a message descriptor containing a native proto column type.

This would the next step in tackling this issue, correct?

@dt
Copy link
Member Author

dt commented Dec 14, 2022

The current builtin functions are internal-use only: they only allow de/encoding bytes from/to any proto message types that are directly compiled into CockroachDB itself, where we use protobuf messages defined in the CockroachDB codebase to store things in system tables like jobs, schema descriptors, etc.

I think this feature request is more about allowing a user-supplied message descriptor to be used to marshal or unmarshal messages for user-supplied message types, not just the built-in descriptors / message types. Once you have that facility, you could go further and persist a user-supplied message descriptor somewhere, either where they can then be used by name or potentially in a table schema along side a bytes column or in a user-defined type, so that columns of that type get the automatic ser/de treatment?

Either way, we haven't yet added functions that allow interacting with anything other than pre-defined system-internal proto message types.

@cpustejovsky
Copy link
Contributor

@dt I've made some progress on this and have the results of it in a separate repository https://github.com/cpustejovsky/filedescriptorjson

It is currently separated because I wasn't sure how to get a test FileDescriptor to use for the unit test from the code inside the cockroach repo since most of this functionality is with the golang/protobuf library and isn't available within the cockroach repository.

It seems like using a FileDescriptor or FileDesciptorSet is ideal as a bare MessageDescriptor won't always have sufficient context (here is a relevant conversation on the Gopher slack explaining).

It should be possible to use the new google/protobuf package to do this and return a jsonb.JSON type. I think all I would need to test this is to add a generated pb.go file that had the FileDescriptor property like seen here and use that for the unit test. Is that something I could add in a PR in addition to the functionality and unit tests?

cpustejovsky added a commit to cpustejovsky/cockroach that referenced this issue Dec 14, 2023
Part of cockroachdb#47534

Cockroachdb cannot currently marshal a protobuf binary into JSON based on a provided FileDescriptor.

The newest version of the `google.golang.org/protobuf` package allows for this functionality. gogo/protobuf does not.

Since cockroachdb uses gogo/protobuf for its better performance, I imported `google.golang.org/protobuf` under an alias.

Using that libary, I created `NewJSONMessageFromFileDescriptorn` to take a protobuf binary and a FileDescriptorand return JSON.

To handle linting errors, temporary wrapper functions `TODOMarshal` and `TODOUnmarshal`; these will need to be renamed if this workaround is acceptable

Release note: none
@cpustejovsky
Copy link
Contributor

@dt I've written a PR (#116262) to address the first step.

I sorted out the problems I was having earlier around testing, but to handle linting errors, I created wrapper functions TODOMarshal and TODOUnmarshal inside pkg/util/protoutil/marshal.go

I marked them as TODO because I wasn't sure if this solution was acceptable and, if it is, what the names should be to properly distinguish them from the Marshal and Unmarshal functions that rely on gogo.

@dt
Copy link
Member Author

dt commented Dec 18, 2023

Exciting stuff @cpustejovsky !

Let me ask around internally to see if I can find someone closer to this area of the codebase to work with you on answering some of those questions.

@cpustejovsky
Copy link
Contributor

@dt were you able to find someone?

@dt
Copy link
Member Author

dt commented Feb 7, 2024

Hi @cpustejovsky, I'm so sorry this slipped off my queue while I was out for a couple weeks over the holidays and I dropped the ball here. Let me ask around again and get back to you, and sorry again about the delay!

@cpustejovsky
Copy link
Contributor

That's what I figured! No worries

@cpustejovsky
Copy link
Contributor

@dt It looks like there won't be updates to this issue any time soon based on this this comment:

Hi @cpustejovsky, we appreciate your contribution and the time you've invested in creating this PR. Unfortunately, our DB team is facing a high workload at the moment and we likely won't be able to review your PR anytime soon. My apologies for the inconvenience. We will prioritize this for review once we have more bandwidth available.

If you have any questions or concerns in the meantime, please feel free to reach out. We appreciate your understanding and continued collaboration.

cpustejovsky added a commit to cpustejovsky/cockroach that referenced this issue Jun 18, 2024
Informs: cockroachdb#47534

Cockroachdb cannot currently marshal a protobuf binary into JSON based on a provided FileDescriptor.

The newest version of the `google.golang.org/protobuf` package allows for this functionality. gogo/protobuf does not.

Since cockroachdb uses gogo/protobuf for its better performance, I imported `google.golang.org/protobuf` under an alias.

Using that libary, I created `NewJSONMessageFromFileDescriptor` to take a protobuf binary and a FileDescriptorand return JSON.

To handle linting errors, temporary wrapper functions `TODOMarshal` and `TODOUnmarshal`; these will need to be renamed if this workaround is acceptable

Release note: none
cpustejovsky added a commit to cpustejovsky/cockroach that referenced this issue Jun 18, 2024
Informs: cockroachdb#47534

Cockroachdb cannot currently marshal a protobuf binary into JSON based on a provided FileDescriptor.

The newest version of the `google.golang.org/protobuf` package allows for this functionality. gogo/protobuf does not.

Since cockroachdb uses gogo/protobuf for its better performance, I imported `google.golang.org/protobuf` under an alias.

Using that libary, I created `NewJSONMessageFromFileDescriptor` to take a protobuf binary and a FileDescriptorand return JSON.

To handle linting errors, I created temporary functions `TODOMarshal` and `TODOUnmarshal` that wrap the later protobuf package; these may need to be renamed if this workaround is acceptable

Release note: none
XiaochenCui added a commit to XiaochenCui/cockroach that referenced this issue Jul 21, 2024
With the built-in "protobuf" type, the protobuf data which is
stored in the database can be rendered as a human-readable format
directly. It can also be queried and indexed directly.

The original issue is cockroachdb#47534.

Fixes: None

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-builtins SQL built-in functions and semantics thereof. A-sql-datatypes SQL column types usable in table descriptors. E-starter Might be suitable for a starter project for new employees or team members. good first issue
Projects
None yet
Development

No branches or pull requests

6 participants