-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GLUTEN-8128][VL] Retry borrow when granted less than size in multi-slot and shared mode #8132
base: main
Are you sure you want to change the base?
Conversation
Thanks for opening a pull request! Could you open an issue for this pull request on Github Issues? https://github.com/apache/incubator-gluten/issues Then could you also rename commit message and pull request title in the following format?
See also: |
Run Gluten Clickhouse CI on x86 |
Run Gluten Clickhouse CI on x86 |
Hi @kecookier thanks! Is this ready for review or still in draft? |
@zhztheplayer Thanks, still in draft, more tests are needed for this PR. |
Run Gluten Clickhouse CI on x86 |
1 similar comment
Run Gluten Clickhouse CI on x86 |
057a94d
to
cce6aea
Compare
Run Gluten Clickhouse CI on x86 |
Run Gluten Clickhouse CI on x86 |
Run Gluten Clickhouse CI on x86 |
Run Gluten Clickhouse CI on x86 |
@zhztheplayer This PR is ready. Can you help review it? |
long requiredSize = Math.min(freeBytes(), size); | ||
long granted = borrow0(requiredSize); | ||
|
||
// If isDynamicCapacity is true, which means it is controlled by vanilla Spark, | ||
// and if the granted memory is less than the required size, it may be because | ||
// this task holds memory exceeding maxMemoryPerTask. Therefore, we actively retry | ||
// spilling all memory. After this, if there is still not enough memory acquired, | ||
// it should result in an OOM. | ||
if (granted < requiredSize && isDynamicCapacity) { | ||
LOGGER.info( | ||
"Exceed Spark perTaskLimit with maxTaskSizeDynamic when " | ||
+ "require:{} got:{}, try spill all.", | ||
requiredSize, | ||
granted); | ||
long spilled = TreeMemoryTargets.spillTree(this, Long.MAX_VALUE); | ||
long remain = requiredSize - granted; | ||
if (spilled > remain) { | ||
granted += borrow0(remain); | ||
} else { | ||
// OOM | ||
} | ||
} | ||
return granted; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @kecookier, thank you for working on this.
Can we add an individual memory target for doing the retries?
Something like
public static class RetryOnOomMemoryTarget extends MemoryTarget {
private final TreeMemoryTarget target;
@Override
public long borrow(long size) {
final long granted = target.borrow(size);
if (granted >= size) {
return granted;
}
// Granted < size. Spill the underlying memory target then retry borrowing.
final long remaining = size - granted;
TreeMemoryTargets.spillTree(target, Long.MAX_VALUE);
final long granted2 = target.borrow(remaining);
return granted + granted2;
}
I can help do a further benchmark too see if chaining another memory target hurts the overall performance. Thought we can make a try like this anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhztheplayer I see, something like ThrowOnOomMemoryTarget( RetryOnOomMemoryTarget( Overacquired(...) ) )?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like that but I assume we tended to add it to this line? Since shared mode is only interested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add a comment like "Retry of spilling is needed in shared mode because ... " or something
if (GlutenConfig.getConf().memoryIsolation()) {
return TreeMemoryConsumers.isolated()
.newConsumer(tmm, name, spiller, virtualChildren);
} else {
// Retry of spilling is needed in shared mode because...
return MemoryTargets.retrySpillOnOom(
TreeMemoryConsumers.shared()
.newConsumer(tmm, name, spiller, virtualChildren));
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhztheplayer Thanks, I'll update the code soon.
What changes were proposed in this pull request?
Retry memory borrow when granted less than requested size in multi-slot and Gluten shared mode. If ThrowOnOomMemoryTarget cannot acquire memory, spill more memory and then retry.
(Fixes: #8128)
If the case in issue #8128 occurs, the first borrow(sizeA) will spill sizeA memory. After spill, if curMemory is greater than the maxPerTaskSize, we'll get 0 returned. So we retry spill(long.max), then re-borrow.
isDynamicCapacity
when constructTreeMemoryTargets#Node
, which means multi-task-slot and shared mode.How was this patch tested?
Test with my production ETL.