[WIP] Add INSERT OVERWRITE to Trino SQL #11603

georgewfisher · 2022-03-21T18:53:39Z

Add INSERT OVERWRITE to Trino SQL to allow connectors to provide overwrite functionality without the need for session parameters.

Related issue: #11602

Current functionality: example multi-stage query:

SET SESSION hive.insert_existing_partitions_behavior='OVERWRITE';
INSERT INTO hive.test2.insert_test SELECT * FROM tpch.sf1.customer;

Can instead be written like this:

INSERT OVERWRITE hive.test2.insert_test SELECT * FROM tpch.sf1.customer;

Notes

This is a change to the SQL parser, connector interface, and the Hive connector.

Introduces an new variant of beginInsert on the connector metadata interface
Adds INSERT OVERWRITE to trino SQL
Modifies Hive connector to accept overwrite requests via the beginInsert operation and these supercede the insert_existing_partitions_behavior setting

Documentation

( ) No documentation is needed.
(x) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.

Release notes
( ) No release notes entries required.
(x) Release notes entries required with the following suggested text:

# General
* Add INSERT OVERWRITE to Trino SQL, enable connector interface, integrate into Hive connector

Details

I have opened an issue here where explain why I am proposing this pull request: #11602

Justifications Summary

Allow for insert overwrite to be used by clients who do not have access to session properties
Simplify insert overwrite to query writers who do not find session parameters easy to use or remember
Standardize insert overwrite across all connectors in Trino
Uses syntax that is common to Spark and Hive

Motivations

Insert overwrite is simple, user friendly syntax that is familiar to users of Spark and Hive
Insert overwrite, in my experience, has been a core scenario for many teams using Trino and therefore a good candidate for promotion beyond a session property

cla-bot · 2022-03-21T18:53:41Z

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: George Fisher.
This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email email@example.com
Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

dain · 2022-03-21T19:41:19Z

This has been discussed many, many, many times. The goal of Trino is not to be a better Hive, but to be a ANSI compliant MPP engine over arbitrary data. This means changing the language for specific Hive behaviors is very unlikely to happen. In this specific case, partitioned data has never been particularly popular outside of very large (and very high tech) companies, and most of these are moving to new and better designs like Iceberg, so I don't think there is much chance of this getting accepted into Trino. In the future, I suggest you discussion with the community in Slack before putting a lot of work into a big change like this.

georgewfisher · 2022-03-21T22:49:22Z

Hi @dain thanks for the comments. After my thread on Trino Slack about this feature I wondered how this would be received.

I over-emphasized the Hive similarity - we do not actually use Hive outside Hive tables. My history with this kind of scenario extends back before I ever met a Hive table. My experience with the relative importance of this scenario is exactly as you say, at very large tech companies where the data size is unmanageable without partitioning.

Also, no worries about the effort: (1) I gained deep insight into the controller interface (2) a dive into Trino testing (3) I started to implement this as a proof-of-concept and it moved quickly (4) this is a change I can always use - nothing is lost.

dain · 2022-03-22T19:54:35Z

Thanks. The slack discussion on this is great. The conclusion of that discussion is the upcoming MERGE support should be able to cover this usecase.

georgewfisher · 2022-03-23T19:04:47Z

Great. I will continue discussion of this in the issue thread.

georgewfisher added 14 commits March 9, 2022 12:02

Insert Overwrite

c69df8b

Merge

c6298f0

Address merge issues

ef3e912

Fix Hive tests

cb87c3e

Fixed tests

560cb1b

Fix tests #3

4f3809f

Fix tests #4

5ac0f6b

Fix tests #5

7a9af8f

Revert to separate table name

e7f694a

Try removing test

e282a88

Try combining tests

f63addb

Try combining tests

1a6497b

Try combining tests

f996b4f

Try combining tests

0f81bfc

georgewfisher mentioned this pull request Mar 21, 2022

Add INSERT OVERWRITE to Trino SQL #11602

Closed

github-actions bot added docs tests:hive labels Mar 21, 2022

dain closed this Mar 22, 2022

ebyhr mentioned this pull request Jun 13, 2023

Support insert overwrite dir #8458

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add INSERT OVERWRITE to Trino SQL #11603

[WIP] Add INSERT OVERWRITE to Trino SQL #11603

georgewfisher commented Mar 21, 2022

cla-bot bot commented Mar 21, 2022

dain commented Mar 21, 2022

georgewfisher commented Mar 21, 2022

dain commented Mar 22, 2022

georgewfisher commented Mar 23, 2022

[WIP] Add INSERT OVERWRITE to Trino SQL #11603

[WIP] Add INSERT OVERWRITE to Trino SQL #11603

Conversation

georgewfisher commented Mar 21, 2022

Notes

Details

Justifications Summary

Motivations

cla-bot bot commented Mar 21, 2022

dain commented Mar 21, 2022

georgewfisher commented Mar 21, 2022

dain commented Mar 22, 2022

georgewfisher commented Mar 23, 2022