Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add INSERT OVERWRITE to Trino SQL #11603

Closed
wants to merge 14 commits into from
Closed

[WIP] Add INSERT OVERWRITE to Trino SQL #11603

wants to merge 14 commits into from

Conversation

georgewfisher
Copy link

Add INSERT OVERWRITE to Trino SQL to allow connectors to provide overwrite functionality without the need for session parameters.

Related issue: #11602

Current functionality: example multi-stage query:

  1. SET SESSION hive.insert_existing_partitions_behavior='OVERWRITE';
  2. INSERT INTO hive.test2.insert_test SELECT * FROM tpch.sf1.customer;

Can instead be written like this:

  1. INSERT OVERWRITE hive.test2.insert_test SELECT * FROM tpch.sf1.customer;

Notes

This is a change to the SQL parser, connector interface, and the Hive connector.

  • Introduces an new variant of beginInsert on the connector metadata interface
  • Adds INSERT OVERWRITE to trino SQL
  • Modifies Hive connector to accept overwrite requests via the beginInsert operation and these supercede the insert_existing_partitions_behavior setting

Documentation

( ) No documentation is needed.
(x) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.

Release notes
( ) No release notes entries required.
(x) Release notes entries required with the following suggested text:

# General
* Add INSERT OVERWRITE to Trino SQL, enable connector interface, integrate into Hive connector

Details

I have opened an issue here where explain why I am proposing this pull request: #11602

Justifications Summary

  • Allow for insert overwrite to be used by clients who do not have access to session properties
  • Simplify insert overwrite to query writers who do not find session parameters easy to use or remember
  • Standardize insert overwrite across all connectors in Trino
  • Uses syntax that is common to Spark and Hive

Motivations

  • Insert overwrite is simple, user friendly syntax that is familiar to users of Spark and Hive
  • Insert overwrite, in my experience, has been a core scenario for many teams using Trino and therefore a good candidate for promotion beyond a session property

@cla-bot
Copy link

cla-bot bot commented Mar 21, 2022

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: George Fisher.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@dain
Copy link
Member

dain commented Mar 21, 2022

This has been discussed many, many, many times. The goal of Trino is not to be a better Hive, but to be a ANSI compliant MPP engine over arbitrary data. This means changing the language for specific Hive behaviors is very unlikely to happen. In this specific case, partitioned data has never been particularly popular outside of very large (and very high tech) companies, and most of these are moving to new and better designs like Iceberg, so I don't think there is much chance of this getting accepted into Trino. In the future, I suggest you discussion with the community in Slack before putting a lot of work into a big change like this.

@georgewfisher
Copy link
Author

Hi @dain thanks for the comments. After my thread on Trino Slack about this feature I wondered how this would be received.

I over-emphasized the Hive similarity - we do not actually use Hive outside Hive tables. My history with this kind of scenario extends back before I ever met a Hive table. My experience with the relative importance of this scenario is exactly as you say, at very large tech companies where the data size is unmanageable without partitioning.

Also, no worries about the effort: (1) I gained deep insight into the controller interface (2) a dive into Trino testing (3) I started to implement this as a proof-of-concept and it moved quickly (4) this is a change I can always use - nothing is lost.

@dain
Copy link
Member

dain commented Mar 22, 2022

Thanks. The slack discussion on this is great. The conclusion of that discussion is the upcoming MERGE support should be able to cover this usecase.

@dain dain closed this Mar 22, 2022
@georgewfisher
Copy link
Author

Great. I will continue discussion of this in the issue thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

2 participants