Recursive and union types #125

esammer · 2022-01-08T22:01:45Z

esammer
Jan 8, 2022

Hey all. We're evaluating Substrait for plan representation at Decodable within a stream processing engine. We frequently deal with the initial structuring of data from complex sources. This means we have to model complicated cases within the type system. A few questions.

Recursive types
How do folks intend to represent (infinitely) recursive type definitions in the type system? We frequently deal with progressive specification of a type in a pipeline: where a record begins with a fully generic type (e.g. Value which is a union of all scalar types as well as list, map<string, Value>) and becomes incrementally more well-defined as it passes through stages of a pipeline of operators.

Union types
I read the rationale for ignoring unions, but I think they're deeply important for complex cases as mentioned above. We could (and do) use column-like structures like struct<[][]a, [][]b, [][]c, ...> that are indexed by field position, but asking users to think about queries in this way is really difficult. We could represent things this way within the engine and just provide syntactic sugar in the language, but it's really wasteful and complex. What are folks' thoughts on this?

A real-world example of these two issues is a source that starts with json records with heterogeneous structures and uses a series of filters and projections to structure that data based on its contents.
e.g.

-- Parse an arbitrary json blob from some source.
create temporary view json as select json_parse(value) as data from x;

-- HTTP records in two different formats. Normalize them and output.

insert into http_events
select data['req']['path'], data['req']['response_code'], ... -- path is a string, response_code is an i32.
from json
where
  data contains 'request' and
  data['req']['url'] is not null and
  data['req']['proto'] in ('http', 'https');

insert into http_events
select data['path'], data['status'], ... -- different names for the same data as above.
from json
where data['type'] = 'http';

-- Traces also appear in data from `json`. Send it to a separate output.
insert into my_app_traces
select data['span_id'], data['request_id'], ...
from json
where data['source'] in ( 'my-app1', 'my-app2' );

jacques-n · 2022-02-01T19:42:58Z

jacques-n
Feb 1, 2022
Maintainer

Ugh, I really need to figure out how to get these discussions to alert somewhere. Anybody know how to do this? Hey @esammer , welcome. Sorry that nobody responded sooner.

@cpcloud, @rdblue, @westonpace drawing your attention to this discussion.

I'm going to answer your questions in backwards the order you suggested them as I wonder whether one can build to the other. I'll also start by noting that neither of these are things I've personally thought a lot about.

Unions:
I actually worked with unions extensively in another life. They are not my favorite thing :)

I think there are four types of patterns to consider:

	Leaf divergence (only different types can be union branches)	Arbitrary Divergence (the same type can occur multiple times in a union)
Explicit Referencing	SQL JSON functions are roughly in this category. You reference fields based on their position first but you also have to coerce to a consistent type before use.	Hive, Avro and arrow generally operate in this world. You have to choose a brunch explicitly before doing something.
Implicit Referencing	Drill (mixed type) and Snowflake (variant) operate in this world. In Drill, we tried to do the most complex thing: allow direct use such as a case statement which switches on type and uses type specific functions with implicit type coercion. In Snowflake they did something much cleaner, not allowing users to interact with the data without casting.	I'm not aware of a system that operates here. It probably doesn't make sense (since how would you implicitly reference something in union<float, float>.

For reference (not sure if you saw this), a Substrait struct type does not carry fields. As part of implementing a clean/consistent set of semantics, Structs are entirely positional. As such, a struct declaration might look like struct<i8, struct<i8,i16>,struct<string,string>>. If you think about the explicit referencing pattern around unions, e.g. Hive's and Arrow's, you typically reference a particular union branch by either position or name (I think avro uses names??). In the context of Substrait, this really means that structs look very much like what other tools treat as unions. So in the case of explicit referencing, I'm not sure I see any differences between the two options.

In implicit referencing, I generally think that the only viable plan is doing something like a variant, which isn't coercible to other types without an explicit cast operation (or similar sql json function type that declares explicit type). In those situations, it also seems like the struct representation is sufficient along with potentially a specialized function that accepts an arbitrary struct and resolves based on a particular set of definitions.

So, I guess the question is what specifically feels "wasteful and complex" as you put it?

Recursive types.
We do support recursive types today but only with pre-defined types. It seems like what we need to enhance to allow you to implement the pattern the way you propose is the ability to create compound user defined types (something we don't have today). This has come up several times and is definitely needed. I believe @cpcloud mentioned it recently. I'd love to see someone come up with a proposal of how to express this as a structured definition as an extension of the current yaml design for extension types. Then you could define megaunion => struct<i8,i64,... ,megaunion> and do things recursively.

I also think it may be possible in your particular use case to think of this more as a new extension type you simply name "variant", similar to Snowflake. You could then, through function invocations or similar ultimately turns this into a known type?

Don't perceive any of this as a hard pushback on union types, more just trying to get to the root of the problems and figure out the right solution.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recursive and union types #125

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Recursive and union types #125

esammer Jan 8, 2022

Replies: 1 comment

jacques-n Feb 1, 2022 Maintainer

esammer
Jan 8, 2022

jacques-n
Feb 1, 2022
Maintainer