-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VL] Gluten OOM with multi-slot executor configuration due to the Vanilla Spark memory acquisition strategy #8128
Comments
How Vanilla Spark act here? I remember @zhztheplayer mentioned when the first job take all executor memory, the second job schedule, the any memory allocation in the first job should spill its data to release memory for second job. Is it? |
@FelixYBW If the first job holds all the executor memory, which is M, then after the second job is scheduled, when the first job requests an allocation of SIZE memory, it will spill SIZE memory. However, if (M - SIZE) > M/2, Vanilla Spark will grant 0 for this request. |
Detail of Vanilla Spark private[memory] def acquireMemory(
numBytes: Long,
taskAttemptId: Long,
maybeGrowPool: Long => Unit = (additionalSpaceNeeded: Long) => (),
computeMaxPoolSize: () => Long = () => poolSize): Long = lock.synchronized {
assert(numBytes > 0, s"invalid number of bytes requested: $numBytes")
// Keep looping until we're either sure that we don't want to grant this request (because this
// task would have more than 1 / numActiveTasks of the memory) or we have enough free
// memory to give it (we always let each task get at least 1 / (2 * numActiveTasks)).
// TODO: simplify this to limit each task to its own slot
while (true) {
val numActiveTasks = memoryForTask.keys.size
val curMem = memoryForTask(taskAttemptId)
// In every iteration of this loop, we should first try to reclaim any borrowed execution
// space from storage. This is necessary because of the potential race condition where new
// storage blocks may steal the free execution memory that this task was waiting for.
maybeGrowPool(numBytes - memoryFree)
// Maximum size the pool would have after potentially growing the pool.
// This is used to compute the upper bound of how much memory each task can occupy. This
// must take into account potential free memory as well as the amount this pool currently
// occupies. Otherwise, we may run into SPARK-12155 where, in unified memory management,
// we did not take into account space that could have been freed by evicting cached blocks.
val maxPoolSize = computeMaxPoolSize()
val maxMemoryPerTask = maxPoolSize / numActiveTasks
val minMemoryPerTask = poolSize / (2 * numActiveTasks)
// How much we can grant this task; keep its share within 0 <= X <= 1 / numActiveTasks
val maxToGrant = math.min(numBytes, math.max(0, maxMemoryPerTask - curMem))
// Only give it as much memory as is free, which might be none if it reached 1 / numTasks
val toGrant = math.min(maxToGrant, memoryFree)
// We want to let each task get at least 1 / (2 * numActiveTasks) before blocking;
// if we can't give it this much now, wait for other tasks to free up memory
// (this happens if older tasks allocated lots of memory before N grew)
if (toGrant < numBytes && curMem + toGrant < minMemoryPerTask) {
logInfo(s"TID $taskAttemptId waiting for at least 1/2N of $poolName pool to be free")
lock.wait()
} else {
memoryForTask(taskAttemptId) += toGrant
return toGrant
}
}
0L // Never reached
} |
Correct. Though some of the memory consumers may not respect SIZE and will spill as much it held as possible. E.g.,
@kecookier Seems this code also conducts retrying. Do it and the PR functionally overlap more or less? |
@zhztheplayer Yes, vanilla Spark only promises that each task holds at least minMemoryPerTask(1/2N) of the executor's memory. If the task already holds more than maxMemoryPerTask(1/N), it will not retry. |
Backend
VL (Velox)
Bug description
After spilled large memory, Task OOM, the detail log
The Underlying Logic
Executor slots = 2, and we use Gluten in shared mode. This means that Gluten will not limit the memory one task can use; it depends on Vanilla Spark's memory management. The maxPerTaskMem will be dynamic in a multi-slot environment.
In a multi-slot environment, the logic for Spark allocating memory to each task is as follows:
Assume slot = N, and the total execution off-heap memory for the executor is maxPoolSize. The maximum memory limit set by Spark for each task (maxPerTask) is dynamic and depends on the current number of tasks running in parallel (activeTaskNum). Spark ensures that the memory each task can request is:
If the memory currently held by a task exceeds maxPerTask, any further memory requests will immediately return 0. This situation can easily occur in a multi-slot environment because activeTaskNum can change.
Each time Gluten requests memory, it calls Spark's memory request interface. When Spark returns 0, Gluten immediately considers it as an OOM.
For example, if slot = 8, consider the following timeline:
Root Cause
In the real case, slot = 2, executor.offheap = 12G. When activeTaskNum = 1, task 1925 holds 8.8G off-heap memory. Then Task 2007 is scheduled to this executor, maxPerTask is 12G / 2 = 6G. Then task 1925 acquires 8MB, as the logic described above, Spark will return 0. It triggers a spill of 8MB, but after the spill, the memory held by the task is still larger than 6G, so it still returns 0. OverAcquireTarget will reserve 8.8G * 0.3, which triggers a spill of 2.64G. In actuality, shuffleWrite spills almost 5.5G, but this is no help. The function will still return 0.
How to Resolve?
spark.gluten.memory.isolation
mode, which can use a maximum of (executor.offheap.size / slot * 0.5). This value is less than maxPerTask of Vanilla Spark, which will waste a maximum of (executor.offheap.size / slot * 0.5) because storage memory will be shrunk in Vanilla Spark.Spark version
None
Spark configurations
No response
System information
No response
Relevant logs
No response
The text was updated successfully, but these errors were encountered: