Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Gluten-4732][CH] delta-mergetree support update/delete/upsert/insert in a more native delta way #4733

Merged
merged 10 commits into from
Mar 11, 2024

Conversation

binmahone
Copy link
Contributor

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

(Fixes: #4732)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

4 similar comments
Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

@binmahone binmahone changed the title [Gluten-4732][CH] avoid copying too many files from Delta [Gluten-4732][CH] delta-mergetree support update/delete/upsert/insert in a more native delta way Feb 29, 2024
Copy link

Run Gluten Clickhouse CI

}

def getFileFormat(meta: Metadata): DeltaMergeTreeFileFormat = {
val fileFormat = new DeltaMergeTreeFileFormat(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

directly new DeltaMergeTreeFileFormat( for returning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

}

object ClickHouseTableV2 extends Logging {
val deltaLog2Table = mutable.HashMap[DeltaLog, ClickHouseTableV2]()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider to use ConcurrentHashMap ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

}

override def fileFormat(metadata: Metadata = metadata): FileFormat =
ClickHouseTableV2.deltaLog2Table(this).getFileFormat(metadata)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems it can not use this way to get the ClickHouseTableV2, because if there is no writing data operation in this spark session, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

optionalBucketSet: Option[BitSet],
optionalNumCoalescedBuckets: Option[Int],
disableBucketedScan: Boolean): Seq[InputPartition] = {
val tableV2 = ClickHouseTableV2.deltaLog2Table(deltaLog)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems it can not use this way to get the ClickHouseTableV2, because if there is no writing data operation in this spark session, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 388 to 392
deltaScan.files
.map(
addFile => {
val addFileAsKey = AddFileAsKey(addFile)
ClickhouseSnapshot.fileStatusCache.get(addFileAsKey)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this fileStatusCache only cache the AddMergeTreeParts but not reduce the time for listing from delta log ? the deltaScan.files seems it will call select action.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link

github-actions bot commented Mar 1, 2024

Run Gluten Clickhouse CI

3 similar comments
Copy link

github-actions bot commented Mar 4, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Mar 4, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Mar 5, 2024

Run Gluten Clickhouse CI

@zzcclp
Copy link
Contributor

zzcclp commented Mar 5, 2024

LGTM

Copy link

github-actions bot commented Mar 7, 2024

Run Gluten Clickhouse CI

2 similar comments
Copy link

github-actions bot commented Mar 7, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Mar 7, 2024

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

@binmahone binmahone merged commit 3f30efd into apache:main Mar 11, 2024
17 checks passed
@GlutenPerfBot
Copy link
Contributor

===== Performance report for TPCH SF2000 with Velox backend, for reference only ====

query log/native_4733_time.csv log/native_master_03_10_2024_3ad58ce14_time.csv difference percentage
q1 35.79 35.86 0.065 100.18%
q2 24.14 25.95 1.808 107.49%
q3 38.31 36.82 -1.493 96.10%
q4 37.54 38.96 1.415 103.77%
q5 69.27 70.83 1.557 102.25%
q6 7.28 7.41 0.124 101.71%
q7 84.54 83.10 -1.447 98.29%
q8 86.16 85.35 -0.802 99.07%
q9 119.47 118.73 -0.741 99.38%
q10 46.02 42.98 -3.035 93.40%
q11 20.35 21.60 1.252 106.15%
q12 27.21 25.06 -2.149 92.10%
q13 47.11 46.55 -0.556 98.82%
q14 18.76 19.73 0.963 105.13%
q15 30.71 32.39 1.672 105.45%
q16 14.74 13.72 -1.018 93.09%
q17 99.57 100.46 0.891 100.89%
q18 141.86 142.43 0.572 100.40%
q19 13.59 13.62 0.030 100.22%
q20 27.10 26.87 -0.233 99.14%
q21 223.72 225.05 1.333 100.60%
q22 14.77 14.04 -0.734 95.03%
total 1228.00 1227.48 -0.526 99.96%

taiyang-li pushed a commit to bigo-sg/gluten that referenced this pull request Mar 25, 2024
… in a more native delta way (apache#4733)

* compile pass

* spark 3.2 works

* fix spark session restart issue

* fix cache problem

* add test case for spark.sql.sources.partitionOverwriteMode

* fix ut on guava stats

* fix file path problem

* fix filesForScan

* add keysample info

* fix uri 2
taiyang-li pushed a commit to bigo-sg/gluten that referenced this pull request Oct 8, 2024
… in a more native delta way (apache#4733)

* compile pass

* spark 3.2 works

* fix spark session restart issue

* fix cache problem

* add test case for spark.sql.sources.partitionOverwriteMode

* fix ut on guava stats

* fix file path problem

* fix filesForScan

* add keysample info

* fix uri 2
taiyang-li pushed a commit to bigo-sg/gluten that referenced this pull request Oct 9, 2024
… in a more native delta way (apache#4733)

* compile pass

* spark 3.2 works

* fix spark session restart issue

* fix cache problem

* add test case for spark.sql.sources.partitionOverwriteMode

* fix ut on guava stats

* fix file path problem

* fix filesForScan

* add keysample info

* fix uri 2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CH] delta-mergetree support update/delete/upsert/insert in a more Delta-like way
3 participants