Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: extend exported proto data structures for simple extension information #50

Merged

Conversation

jvanstraten
Copy link
Collaborator

@jvanstraten jvanstraten commented Sep 8, 2022

The idea here is mostly to define a nice output data structure as an interface to work toward, rather than leaving this until the end and then requiring lots of refactoring for the data structure. There are some new concepts in here as well that don't exist in Substrait yet (such as already supporting substrait-io/substrait#265, substrait-io/substrait#320, various new description fields that were randomly missing from the YAML schema, etc) to future-proof somewhat.

Some semantic changes to patterns snuck in here as well. Specifically, bindings now always bind the non-nullable version of typename metavalues, and only match non-nullable by default. This makes their behavior more consistent with typenames like i32 (which means non-nullable, not either nullability).

BREAKING CHANGE: validator.proto has been split up into three files. Some existing fields relating to extensions have now also been deprecated.

…rmation

BREAKING CHANGE: validator.proto has been split up into three files. Some
existing fields relating to extensions have now also been deprecated.

// All extensions defined in the type class namespace of this extension. All
// names are case-insensitively unique.
repeated ExtensionDefinition type_classes = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about making this (and the same for the other fields) repeated TypeClass where message TypeClass { ExtensionMetaData metadata, ... }, and message ExtensionMetaData { string uri = 1; repeated string names = 2; ... }? This would make consuming these messages more type safe: no need to match on the kind (it should match but there is no guarantee by the given structure).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored, but it kinda rippled all over the place thanks to there being different function extension types.

Comment on lines 55 to 58
// The set of names that may be used to case-insensitively refer to this
// extension within the scope of the above URI. For type classes and type
// variations there will only ever be one of these. For functions, the first
// name is always the compound name. The simple name will only be added when
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the names are different for the different variants, maybe we should just copy-paste the relevant fields instead of introducing a metadata message type.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, those three files together very clearly represent identifying information for extensions. It's annoying that Substrait makes that exception for function naming that a simple name can sometimes also be used, but I'd rather generalize that special case to all the other extension types as well in case more weirdness ends up being added later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer using the proto definition to guide based on type instead of having to deal with these semantics when consumed, but I understand how this would blow up if applied everywhere.


// Any resolved extensions defined by dependencies of this module. Note that
// these are not publicly exposed by this extension.
repeated ExtensionDefinition dependencies = 5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one confuses me. Can you elaborate a bit on this one?

Copy link
Collaborator Author

@jvanstraten jvanstraten Sep 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extensions should (eventually) have support for dependencies, primarily to allow one YAML with function definitions to use types defined in another YAML. Those type definitions wouldn't be visible directly, since we use URIs to refer to things, so they don't get added to the normal list of type definitions, but they do need to go somewhere such that the data types and patterns eventually used in that function definition can refer to them via the extension_id thing. Everything like that, i.e. not publicly exposed by the extension YAML being represented here, insofar as the validator has resolved them, would end up in that list.

ETA: example of a thing that might refer to something in that list:

// Information about a user-defined type.
message UserDefinedType {
// URI of the YAML file that the type is (supposed to be) defined in, if
// known.
string uri = 1;
// Name of the type within the scope of that YAML file.
string name = 2;
// If nonzero, points to a variation extension definition elsewhere in the
// tree. All extension definitions can be gathered by traversing the tree
// and looking for ExtensionDefinition messages in the data associated with
// each node. Note that extension IDs are only unique within a single tree.
uint64 extension_id = 4;

(somewhere nested deep inside a data type pattern)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I now remember about the max_uri_resolution_depth field in Config.

Comment on lines +94 to +98
// Identifier for the extension.
Identifier identifier = 1;

// Common metadata for the extension.
Metadata metadata = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I think this is an improvement.

@jvanstraten jvanstraten merged commit 7e80632 into substrait-io:main Sep 21, 2022
@jvanstraten jvanstraten deleted the refactor-extend-output-protos branch September 21, 2022 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants