-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify MemoryManager #4522
Simplify MemoryManager #4522
Conversation
let batch = | ||
RecordBatch::try_from_iter(vec![("x", Arc::new(input) as ArrayRef)]).unwrap(); | ||
stagger_batch_with_seed(batch, 42) | ||
let max_batch = 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, slicing batches causes the memory limiter to fail as it doesn't consider array slicing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe stagger batches also adds zero length record batches, which was an important edge case to cover as I recall
@@ -91,6 +98,10 @@ impl BatchBuilder { | |||
|
|||
let num_entries = rng.gen_range(1024..8192); | |||
for i in 0..num_entries { | |||
if self.is_finished() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Early return to make tests run faster
schema: SchemaRef, | ||
in_mem_batches: Mutex<Vec<BatchWithSortArray>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This Mutex aren't actually needed, took the opportunity to just remove them
|
||
impl Drop for MemTrackingMetrics { | ||
fn drop(&mut self) { | ||
self.metrics.try_done(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is redundant as the Drop of BaselineMetrics
already calls this
impl Drop for MemTrackingMetrics { | ||
fn drop(&mut self) { | ||
self.metrics.try_done(); | ||
if self.mem_used() != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer needed as TrackedAllocation does it automatically
/// allocations while processing data. | ||
/// | ||
/// This consumer will NEVER spill. | ||
pub struct MemoryConsumerProxy { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is entirely replaced by TrackedAllocation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that any form of fairness or distribution shall be built either into the consumer itself (if it is capable of doing this) or into the scheduler (which is some future work). Simplification looks good.
cc @yjshen and @richox (Blaze-rs team) who I think contributed some/all of the original implementation; Also cc @milenkovicm who I think has expressed some interest in memory management before |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @tustvold -- I agree I found the existing manager allocation algorithm a bit of a mismatch for DataFusion's needs and this rework matches it better. I would very much like to hear @yjshen's opinion of this change prior to merging.
I think the major downside to this new allocation approach is that that some operators that don't benefit from more memory after a point (e.g. Sort) because they will spill anyways will now consume all of the available memory in the plan at the expense of upstream operators. This may cause a plan with multiple such operators to fail (as the downstream operator will not be able to allocate any memory as it has all been given to the upstream operator). I think there are different ways around this problem, such as adjust the spilling operator strategy so that once it spills it can give back some of the memory it requested, etc.
This situation doesn't really affect the "error if memory budget is exceeded" usecase I do think it affects what happens when operators begin to spill.
If we can't reach consensus on the particular strategy, perhaps we can follow the traditional DataFusion approach and provide an extension point -- namely make MemoryManager
a trait and provide a basic implementation that can be customized by other systems if need be.
I wonder if this closes either #4328 or #4325
Also possible related is @richox 's report that the memory manager divvys up memory like #2829 which may or may not be relevant after this PR
let batch = | ||
RecordBatch::try_from_iter(vec![("x", Arc::new(input) as ArrayRef)]).unwrap(); | ||
stagger_batch_with_seed(batch, 42) | ||
let max_batch = 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe stagger batches also adds zero length record batches, which was an important edge case to cover as I recall
In the interests of moving this PR along, I have posted to the mailing list https://lists.apache.org/thread/plrq40ldy4y9l6bj3157bk54dc029b3w and slack trying to solicit comments |
I propose if we don't get any comments in another 2 days, we merge this PR as is. We can address any comments / other features as follow ons. Please comment here if you need additional time to review or have other commentary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tustvold working on this new version of the memory manager!
Yes, you are correct that the current manager contains both memory usage accounting and memory scheduling (fairness schedule as its current state). And its complexity comes with two folds:
- how to grant memory to a set of memory consumers fairly(consumer id for identifying different consumers)
- how to differentiate consumers requiring more memory during execution (evenly divide available memory of them all) between the ones that take an in-neglectable amount of memory which would return later.
And yes, fair scheduling is only for the simplicity of implementation, to avoid starvation of parallel executed partitions or pipelines and lots of spaces or strategies for more sophisticated scheduling.
Currently, the existing memory manager works well in the Blaze workload which we use DataFusion to run several Spark tasks in parallel inside each Spark Executor. And memory manager controls both Sort and Shuffle. No #4325 is witnessed. Perhaps @richox could correct me or provide more details on developing Blaze and operating it in Kuaishou in production since I left in May.
I would suggest we start the design of memory scheduling and pooling mechanisms and think about how these interact with the new manager and yield a complete plan for the new memory management. I am concerned that after we consider scheduling, we'll have to add code with similar functionality.
I'm also fine with merging the PR as it is if scheduling or pooling is planned and would work shortly. A short time of functionality regression @alamb pointed out doesn't hurt much.
I'll add a policy component to this prior to merge in that case, thank you for taking the time to review |
Marking as a draft until the policy is added, so this PR is removed from my review queue |
I've added a |
@@ -142,8 +126,10 @@ impl Default for RuntimeEnv { | |||
pub struct RuntimeConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
#[derive(Debug, Clone)] | ||
/// Configuration information for memory management | ||
pub enum MemoryManagerConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a bit of a perverse API imo, I'm looking into properly addressing the *Config
profileration in #4349 but I figured I might as well clean this up whilst I was here.
If people want to reuse the same pool, they can just use the same Arc<dyn MemoryPool>
} | ||
/// Options associated with a [`TrackedAllocation`] | ||
#[derive(Debug)] | ||
pub struct AllocationOptions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is designed to be an extension point, so we can potentially add more allocation options and correspondingly more sophisticated MemoryPool.
/// │ z z │ | ||
/// └───────────────────────z──────────────────────z───────────────┘ | ||
/// | ||
/// Unspillable memory is allocated in a first-come, first-serve fashion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is equivalent to the old behaviour
Arc::as_ptr(&memory_manager), | ||
Arc::as_ptr(&ctx2.runtime_env().memory_manager) | ||
)); | ||
assert_eq!(ctx1.runtime_env().memory_pool.allocated(), 100); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test had to be rewitten as pointer equality of fat pointers such as &dyn
is funky
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @tustvold -- I really like where this is going. I had a suggestion about the API design (MemoryPool::register()
) rather than creating TrackedAllocations directly.
I also think we could merge this PR as is if you don't want to make any more changes -- I could try to update the API as part of a follow on PR.
|
||
/// A [`MemoryPool`] that enforces no limit | ||
#[derive(Debug, Default)] | ||
pub struct UnboundedMemoryPool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very nice
} | ||
|
||
impl MemoryPool for FairSpillPool { | ||
fn allocate(&self, options: &AllocationOptions) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent some time convincing myself this would work correctly if there are multiple SortExecs in the plan such that one would completely run before the other (e.g. as happens when there are multiple window functions with different partition / order by clauses)
I believe it will be fine because each ExternalSorter
creates the potentially spilling allocation allocation upon construction, though I am still not 100% sure.
What would you think about moving the construction of TrackedAllocation
into MemoryPool
, something like
/// Create a new tracked allocation with the MemoryManager. Subsequent requests
/// can grow or shrink the memory allocated to this allocation.
///
/// The allocation is automatically deregistered on drop
fn register(self: &Arc<Self>, options: AllocationOptions) -> TrackedAllocation;
That way the intent would be clearer that all ExternalSort instances register themselves with the MemoryManager on creation, and part of registering would increase the num_spill
count.
I think having the MemoryPool create TrackedAllocations makes more logical sense as they are so tightly bound and you need a MemoryPool for a tracked allocation anyways
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it will be fine because each ExternalSorter creates the potentially spilling allocation allocation upon construction, though I am still not 100% sure.
Now that ExecutionPlan is sync I don't think this is an issue, the streams are all created prior to execution starting. Yes theoretically you could lazily construct the TrackedAllocation
, I can't think of why you would structure your code in this way. Regardless this issue existed before the changes in this PR, it is just easier to see
That way the intent would be clearer that all ExternalSort instances register themselves with the MemoryManager on creation, and part of registering would increase the num_spill count.
Perhaps I'm being stupid but I don't see a tangible difference between TrackedAllocation::new(pool, options)
and pool.register(options)
, you're just reordering the parameters? More specifically in order for register
to be implemented, you still need a public TrackedAllocation::new
so you're just adding an indirection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see -- I agree keeping TrackedAllocation::new
makes sense.
I guess I was getting confused by the fact that allocate
doesn't actually allocate
any memory, but instead it really registers a TrackedAllocation
with the memory manager. Likewise free
doesn't free any memory
Could we somehow make the relationship between MemoryManager
, AllocationOptions
and TrackedAllocation
clearer
What about changing some names:
MemoryManager::allocate --> MemoryManager::register
MemoryManager::free --> MemoryManager::deregister
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, instead of AllocationOption
what do you think about the name AllocationId
or Allocator
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've renamed TrackedAllocation
to MemoryReservation
and AllocationOptions
to MemoryConsumer
. I also tweaked the registration API to fit with this. PTAL
I originally had this prior to 718a94b, however, it results in passing round this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tustvold! I really like the new design of pools with different allocation policies, and moving spill()
out of memory management simplifies the APIs a lot.
|
||
/// Return the total amount of memory allocated | ||
fn allocated(&self) -> usize; | ||
} | ||
|
||
/// Options associated with a [`TrackedAllocation`] | ||
/// A memory consumer that can be tracked by [`MemoryReservation`] in a [`MemoryPool`] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
/// a [`MemoryReservation`] that can be used to grow or shrink the memory reservation | ||
pub fn register(self, pool: &Arc<dyn MemoryPool>) -> MemoryReservation { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice interface
Thanks again @tustvold |
Benchmark runs are scheduled for baseline = 30de028 and contender = dba34fc. dba34fc is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
false => { | ||
let available = self | ||
.pool_size | ||
.saturating_sub(state.unspillable + state.unspillable); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why sub state.unspillable twice here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question, @tustvold ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a mistake, good spot
Edit: fix in #5160
Which issue does this PR close?
Closes #2829.
Rationale for this change
The existing memory manager interface is quite tricky to use, and has resulted in the proliferation of "utility" wrappers to make it easier to work with. It also mixes the concept of limiting memory, with "memory scheduling" where it tries to distribute memory "fairly". In practice I'm not sure this would have worked, and would likely have blocked tokio worker threads #4325.
This PR proposes a drastic simplification of MemoryManager
What changes are included in this PR?
There are a couple of major changes worth highlighting:
MemoryPool
Are these changes tested?
Are there any user-facing changes?