Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] support for dynamic partition overwrite #348

Closed
ivoson opened this issue Mar 6, 2020 · 16 comments
Closed

[Feature Request] support for dynamic partition overwrite #348

ivoson opened this issue Mar 6, 2020 · 16 comments

Comments

@ivoson
Copy link

ivoson commented Mar 6, 2020

There is no support for dynamic partition overwrite for now. It seems we can only overwrite with replaceWhere option to specify the overwrite part, which is not convenient.
I am not sure why we don't have dynamic overwrite support in delta, is there any limits about it?

@tdas
Copy link
Contributor

tdas commented Mar 9, 2020

Fundamentally, dynamic partition overwrite is not a great API because it is quite scary; you can accidentally overwrite a partition with bad records falling in partitions that shouldnt have been touched. So it is always safer to use replaceWhere to provide guard rails on what partitions should get overwritten.

In the future, we can always add support for dynamic partition overwrite. But I would still recommend using replaceWhere.

@ivoson
Copy link
Author

ivoson commented Mar 10, 2020

Got it and Thanks for your explain @tdas

@tdas tdas closed this as completed Mar 11, 2020
@koertkuipers
Copy link
Contributor

koertkuipers commented Mar 26, 2020

note that the risk of bad records overwriting partitions is zero if you do something like

dataframe
  .filter($"x" == 1)
  .write
  .format("delta")
  .option("partitionOverwriteMode", "dynamic")
  .partitionBy("x")
  .save("/some/location")

now you might argue that in such a case i can use replaceWhere and you are right.
but what is the logic is split up across jobs?
i can have one job doing the filtering and writing somewhere, and another job reading this data back in and doing the dynamic partition overwrite into delta. now the replaceWhere is suddenly cumbersome (the filter logic sits in another job, i now have to go repeat it), and this is still safe isn't it? we have this situation pretty frequently.

yes you can do dumb things with dynamic partition overwrite, but also very useful things.
removing it hurts adoption in my opinion.
(we added it back in internally because rewriting logic in many places wasnt an option)

@marmbrus
Copy link
Contributor

It is not really that you cannot do dumb things with replaceWhere. Its that the action being taken is explicit rather than implicit.

More over, up until now, every one who has asked for the feature was happy with replaceWhere. As such its never been a priority to add it.

If you got an implementation you should consider opening a PR.

@koertkuipers
Copy link
Contributor

opened PR here:
#371

@gdoron
Copy link

gdoron commented Jan 6, 2021

@tdas to be honest, breaking the normal API of spark 2, instead of .mode("overwrite") only overwrites the dynamic partitions in the df, it deletes the entire table. That's scary.

It's so easy to write overwrite when you are used to working with parquets table and forgetting adding the replaceWhere and boom, the table is gone.

Maybe that's why you added the time machine... because you knew it would happen ;)

@mblahay
Copy link

mblahay commented Aug 3, 2021

Just got bit by this because delta didn't function the same as parquet. Sure it is "dangerous," but it is also a feature that can be turned on and off depending on how one wishes to solve a particular problem. Willing to accept the risk.

@AbderrahmenM
Copy link

Any news about this please ?

@lachupacabra
Copy link

Does one know how to use the replace where filter when the data is partitioned by multiple columns?

image

@koertkuipers
Copy link
Contributor

It is not really that you cannot do dumb things with replaceWhere. Its that the action being taken is explicit rather than implicit.

More over, up until now, every one who has asked for the feature was happy with replaceWhere. As such its never been a priority to add it.

If you got an implementation you should consider opening a PR.

@marmbrus as you suggested i opened PR... in april 2020... it would be great if a committer could look at it sometime. thanks

@tdas tdas reopened this May 12, 2022
@tdas tdas changed the title no support for dynamic partition overwrite [Feature Request] support for dynamic partition overwrite May 12, 2022
@tdas
Copy link
Contributor

tdas commented May 12, 2022

I am reopening this issue. Despite our reservations about the risk of using this feature, by community demand, we have decided to go ahead with implementing this feature.

@tdas
Copy link
Contributor

tdas commented May 12, 2022

@koertkuipers interested in continuing to contribute with the PR?

@koertkuipers
Copy link
Contributor

yes. see:
#371

@tdas
Copy link
Contributor

tdas commented May 12, 2022

Cool. could you update the PR to latest master and run the tests again. lets hope it still works. further discussions on the PR.

mmengarelli pushed a commit to mmengarelli/delta.io that referenced this issue Jun 8, 2022
GitOrigin-RevId: ceacc3a6239f36e76cc8cf4f3447285bb06fc6a0

CDF + Vacuum tests

This PR adds two tests testing CDF + VACUUM integration.

Closes delta-io#1177

GitOrigin-RevId: f2f7b187cb3cc78c267d378eaf3d0657a56241d9

Adds miscellaneous (e.g. end-to-end workload, and CDCReader) tests for CDF.

Resolves delta-io#1178

GitOrigin-RevId: 5c7da4ff9413d84e73137a80673872065de8267b

Minor refactor to UpdateCommand

GitOrigin-RevId: ab3fcbe0522aa194ac0781730a50112655d5c7ec

 Fixes delta-io#348 Support Dynamic Partition Overwrite

The goal of this PR to to support dynamic partition overwrite mode on writes to delta.

To enable this on a per write add `.option("partitionOverwriteMode", "dynamic")`. It can also be set per sparkSession in the SQL Config using `.config("spark.sql.sources.partitionOverwriteMode", "dynamic")`.

Some limitations of this pullreq:
Dynamic partition overwrite mode in combination with replaceWhere is not supported. If both are set this will result in an error.
The SQL `INSERT OVERWRITE` syntax does not yet support dynamic partition overwrite. For this more changes will be needed to be made to `org.apache.spark.sql.delta.catalog.DeltaTableV2` and related classes.

Fixes delta-io#348
Closes delta-io#371

Signed-off-by: Allison Portis <allison.portis@databricks.com>
GitOrigin-RevId: 5b01e5b04e573dabe91ac2d71991a127617b8038

Add Checkpoint + CDF test.

This PR adds a test to ensure CDC fields are not included in the checkpoint file.

Resolves delta-io#1180

GitOrigin-RevId: d4a7b8bc4d1a79d30806ff18c5b507f6edcd964c
@bkyryliuk
Copy link

is it supported now ?

jbguerraz pushed a commit to jbguerraz/delta that referenced this issue Jul 6, 2022
The goal of this PR to to support dynamic partition overwrite mode on writes to delta.

To enable this on a per write add `.option("partitionOverwriteMode", "dynamic")`. It can also be set per sparkSession in the SQL Config using `.config("spark.sql.sources.partitionOverwriteMode", "dynamic")`.

Some limitations of this pullreq:
Dynamic partition overwrite mode in combination with replaceWhere is not supported. If both are set this will result in an error.
The SQL `INSERT OVERWRITE` syntax does not yet support dynamic partition overwrite. For this more changes will be needed to be made to `org.apache.spark.sql.delta.catalog.DeltaTableV2` and related classes.

Fixes delta-io#348
Closes delta-io#371

Signed-off-by: Allison Portis <allison.portis@databricks.com>
GitOrigin-RevId: 5b01e5b04e573dabe91ac2d71991a127617b8038
jbguerraz pushed a commit to jbguerraz/delta that referenced this issue Jul 6, 2022
The goal of this PR to to support dynamic partition overwrite mode on writes to delta.

To enable this on a per write add `.option("partitionOverwriteMode", "dynamic")`. It can also be set per sparkSession in the SQL Config using `.config("spark.sql.sources.partitionOverwriteMode", "dynamic")`.

Some limitations of this pullreq:
Dynamic partition overwrite mode in combination with replaceWhere is not supported. If both are set this will result in an error.
The SQL `INSERT OVERWRITE` syntax does not yet support dynamic partition overwrite. For this more changes will be needed to be made to `org.apache.spark.sql.delta.catalog.DeltaTableV2` and related classes.

Fixes delta-io#348
Closes delta-io#371

Signed-off-by: Allison Portis <allison.portis@databricks.com>
GitOrigin-RevId: 5b01e5b04e573dabe91ac2d71991a127617b8038
@scottsand-db
Copy link
Collaborator

Hi @bkyryliuk - it is included in the preview for delta lake 2.0.0 - see the preview release notes here: https://github.com/delta-io/delta/releases/tag/v2.0.0rc1. It will be included in the official 2.0.0 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants