feat: add support for DDL and INSERT/DELETE/UPDATE operations #252

curino · 2022-07-20T14:54:53Z

In collaboration with Jesus Camacho Rodriguez

curino · 2022-07-20T14:57:42Z

Cleaned up git commit log (I spend most of my time fighting this linter!).

curino · 2022-07-26T21:34:44Z

@jacques-n and others, how does this version look?

jacques-n · 2022-07-28T16:23:53Z

Hey @curino , this is on my list to review. Will try to get to it soon.

One thing I still don't understand is the output item in writerel. Can you give a concrete example in pseudo code/description?

curino · 2022-07-28T22:29:37Z

Thanks @jacques-n!

Most DBMS return information about how the insert/delete/update operation went. These takes few forms:

No output (default)
A simple count of how many tuples were affected by the operation
Returning the set of tuples deleted/inserted
Returning the before (or after, or both) image of an update
An arbitrary computation of the affected tuples

What I learned reading few doc pages from various systems is that this is to simplify the common case where apps want to be able to double check how an operation went (e.g., validate that roughly the right number of tuples were affected) and potentially abort/compensate if things were off. This would otherwise required temp tables and complex multi-table transactions, so it is a rather appreciated/used feature.

Including an output term allows us to capture any of the above scenarios in a compact way. Basically we are telling the systems to use input query to determin which tuples to change and output to report on what was done per user required semantics. In some systems there are special keywords the output query can refer to such as BEFORE and AFTER (think of those as tables capturing the before and after image of the affected tuples). I think it is ok to expect that the system generating the substrait plan produces valid pairs of input and output Rels (some validation is possible, but not very deep).

I considered to simplify this and hardcode the "# of affected tuples" semantics as a return value, but this does seem to miss what I read was a rather common use case. I hope this helps clarify the rationale (happy to chat in the community call if it is easier).

Thanks,
Carlo

…ait-io#128)

…ducing the mode (substrait-io#128)

curino · 2022-08-03T18:23:56Z

Per conversation in latest community meeting, I modified the output mechanisms to leverage a spool (via references) for the BEFORE image, and a simple enum NO_OUTPUT or MODIFIED_TUPLES to control the operator output, that we can then process (e.g., with a COUNT) to shape the output of AFTER image changes.

I also realized that in the original docs there were references to a bunch of other things (rotations, etc.) that are not currently represented (or at least I am not clear on what they mean). I would propose we converge on this patch, and then do further iterations to reflects further/fancier things (e.g., we left constraints and indexes out of table definitions as well).

curino · 2022-08-08T15:18:06Z

@jacques-n ping...

jacques-n · 2022-08-08T22:06:04Z

proto/substrait/algebra.proto

+    DDL_OP_ALTER_TABLE = 3;
+
+    DDL_OP_CREATE_VIEW = 4;
+    DDL_OP_CREATE_OR_REPLACE = 5;


This doesn't feel like an operation. It fields like a subcategory of the other operation types.

We could have the DDL_OP being TABLE or VIEW and then have another field capturing CREATE, ALTER, DROP (where VIEW + ALTER is equivalent to the CREATE_OR_REPLACE we have now)

jacques-n · 2022-08-08T22:44:09Z

proto/substrait/algebra.proto

+  // Definition of which type of write operation is to be performed
+  oneof write_type {
+    NamedTableWrite named_table = 1;
+    ReadRel.LocalFiles local_files = 2;


I'm inclined to just include named table and extension table right now. (With the latter being completely independent of the read type.)

Amongst other things, I'm not convinced that a read spec for files is the same as a write spec.

We could include tables only for now, but eventually we want WriteRel to write to file as well I think.

jacques-n · 2022-08-08T22:46:03Z

proto/substrait/algebra.proto

+  // The columns that will be modified (representing after-image of a schema change)
+  NamedStruct base_schema = 2;
+  // The default values for the columns (representing after-image of a schema change)
+  repeated Expression.Literal defaults = 3;


This feels like it doesn't belong here. It isn't really specific to a named table, right? Maybe it should be at the write rel level?

As noted, if we move ctas, this can also exist inside of writerel. Note that we should specifically declare patterns around the requirement of including this (or not) and the behavior of it. I'd also note that after-image isn't clear here.

Even if we move CTAS we need schema (names+types) and defaults for CREATE_TABLE right?

jacques-n · 2022-08-08T22:49:23Z

proto/substrait/algebra.proto

+    DDL_OP_UNSPECIFIED = 0;
+    DDL_OP_CREATE_TABLE = 1;
+    DDL_OP_DROP_TABLE = 2;
+    DDL_OP_ALTER_TABLE = 3;


I suggest we move CTAS to WriteRel. This removes the write_type stuff entirely from this and keeps it only in WriteRel. (This may mean we need to come up with a different name than DDL since that generally includes CTAS.)

Separately, I think we should probably remove CREATE TABLE and ALTER TABLE since there isn't sufficient information here to actually do them. (We need schema, etc.)

If we keep the write_type it contains a NamedTableWrite, which includes all we need for CREATE or ALTER TABLE I think (table name, column names, column types, default values).

jacques-n · 2022-08-08T22:49:42Z

proto/substrait/algebra.proto

+// A base table for writing. The list of string is used to represent namespacing (e.g., mydb.mytable).
+// This assumes shared catalog between systems exchanging a message.
+// it also includes a base schema, and default values
+message NamedTableWrite {


As mentioned elsewhere, let's move this into WriteRel and keep it constrained to that location.

I think this prevents us to do all `CREATE' and 'ALTER' operators.

jacques-n · 2022-08-08T22:55:27Z

proto/substrait/algebra.proto

+message NamedTableWrite {
+  repeated string names = 1;
+  // The columns that will be modified (representing after-image of a schema change)
+  NamedStruct base_schema = 2;


I can't figure out how to use names + base schema to map my partial input onto a broader schema. I also don't think this belongs in Named Table, feels like a generic/cross table concept. After moving properties to writerel, I think we need a better clarity around what this write is and how it maps to an underlying schema. One option:

final schema and a indexbased mapping mapping the input schema to the output. For example:

input: (a int, c int) output: (a int, b int, c int, d int) defaults: (1, 4, 7, 6) -- defined based on the output schema. map: (0,2) maps: input[0] => output[0] and input[1] => output[3]

I can't see how to use a set of names to do this mapping since there could be dupes at various levels of schema. (It also seems anti-substrait-patterns to use some form of naming system to map two schemas).

We should also probably have an option around defaults: do we only apply for missing columns or do we also apply for null values. (The latter is a lot more expensive...)

The thinking is as follows:

names is simply a qualified table name (like mydb.table). I am copying the pattern from ReadRel.NamedTable. Maybe we should call it qualified_table_name or something like that.

base_schema describes the entire schema we want in output (after an ALTER or as a result of CREATE or INSERt operation etc.)

The content is positionally (as usual) mapped to input. For example if you are doing UPDATE TABLE foo WHERE col1=3 WHERE cond, we would have in the input something like SELECT "3", col2, col3... FROM foo WHERE cond.

In hindsight we might be able to skip the defaults field altogether and use the input for that (using a set of constants if we are issuing a CREATE TABLE for example).

…#128)

curino · 2022-08-09T18:29:57Z

Thanks @jacques-n for the live-chat. I have addressed all we discussed, and updated some of the docs accordingly.

jacques-n

One little leftover that should be cleaned but otherwise looking good. I'll fix and merge assuming tests pass.

proto/substrait/algebra.proto

curino · 2022-08-10T13:59:02Z

Thanks @jacques-n and @jcamachor for the help shaping this.

jvanstraten · 2022-08-15T08:56:47Z

I'm a bit late to the party now (was on vacation, sorry), but I can't help but notice that oneof rel_type wasn't updated. AFAICT the new messages are entirely unreachable from Plan. Is that intentional at this time? I haven't read any of this in detail yet though, so I might have missed something.

…ubstrait-io#284 (relation references) - add new relation types to rel_type to make them usable - add constraints to prevent cyclic relation references - document how relation references work in the relation basics section - s/tuples/records/g for naming consistency - move ReferenceRel out of the AggregateFunction message scope BREAKING CHANGE: various messages and semantics that were not yet reachable from the Plan message were changed

curino force-pushed the modops_3 branch from fb70c0c to 3b73943 Compare July 20, 2022 14:56

curino mentioned this pull request Jul 20, 2022

feat: add initial support for DDL and data modification operator (#128) #236

Closed

feat: add support for DDL and INSERT/DELETE/UPDATE operations (substr…

496c9bb

…ait-io#128)

curino force-pushed the modops_3 branch from 3b73943 to 496c9bb Compare August 3, 2022 18:05

feat: simplifying the output operator for INSERT/DELETE/UPDATE, intro…

9818fb9

…ducing the mode (substrait-io#128)

jacques-n reviewed Aug 8, 2022

View reviewed changes

feat: addressing feedback on ddl/update code structure. (substrait-io…

61c0875

…#128)

jacques-n previously approved these changes Aug 10, 2022

View reviewed changes

proto/substrait/algebra.proto Outdated Show resolved Hide resolved

Update proto/substrait/algebra.proto

9d910be

jacques-n dismissed their stale review via 9d910be August 10, 2022 00:04

jacques-n approved these changes Aug 10, 2022

View reviewed changes

jacques-n merged commit cbb6c26 into substrait-io:main Aug 10, 2022

cpcloud mentioned this pull request Aug 11, 2022

feat(proto): add write relation to support multiple outputs #239

Closed

curino deleted the modops_3 branch August 11, 2022 23:01

jvanstraten mentioned this pull request Aug 15, 2022

fix: various fixes for #252 (write and DDL relations) and #284 (relation references) #288

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add support for DDL and INSERT/DELETE/UPDATE operations #252

feat: add support for DDL and INSERT/DELETE/UPDATE operations #252

curino commented Jul 20, 2022

curino commented Jul 20, 2022

curino commented Jul 26, 2022

jacques-n commented Jul 28, 2022

curino commented Jul 28, 2022

curino commented Aug 3, 2022

curino commented Aug 8, 2022

jacques-n Aug 8, 2022

curino Aug 9, 2022

jacques-n Aug 8, 2022

curino Aug 9, 2022

jacques-n Aug 8, 2022

jacques-n Aug 8, 2022

curino Aug 9, 2022

jacques-n Aug 8, 2022

curino Aug 9, 2022

jacques-n Aug 8, 2022

curino Aug 9, 2022

jacques-n Aug 8, 2022

jacques-n Aug 8, 2022

curino Aug 9, 2022

curino commented Aug 9, 2022

jacques-n left a comment

curino commented Aug 10, 2022

jvanstraten commented Aug 15, 2022

feat: add support for DDL and INSERT/DELETE/UPDATE operations #252

feat: add support for DDL and INSERT/DELETE/UPDATE operations #252

Conversation

curino commented Jul 20, 2022

curino commented Jul 20, 2022

curino commented Jul 26, 2022

jacques-n commented Jul 28, 2022

curino commented Jul 28, 2022

curino commented Aug 3, 2022

curino commented Aug 8, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

curino commented Aug 9, 2022

jacques-n left a comment

Choose a reason for hiding this comment

curino commented Aug 10, 2022

jvanstraten commented Aug 15, 2022