-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there an efficient way to understand cpu_cost_model? #43
Comments
@shivramsrivastava Please give me some feedback, thanks a lot. |
@chauncey-77, here is a brief summary of CPU/Memory flow graph building. Hope, this helps you as far as basic description of building our CPU/Memory flow graph is concerned. Rest of the details, you going to have to read through code to get more low level understanding. As a quick brief overview, Firmament scheduler models scheduling problem as a constraint-based optimization over a flow network graph. A min-cost flow algorithm is leveraged for deriving the implied workload placements. A flow network is a directed graph whose arcs carry flow from source nodes (i.e., pod nodes) to a sink node for a feasible solution to the optimization problem. A cost and capacity associated with each arc constrain the flow, and specify preferential routes for it. In the Poseidon/Firmament CPU-Memory cost model, the task equivalence class (EC) gets created based on the task’s cpu and memory request. Each machine will have a set of predefined number of machine ECs (M0EC1, M0EC2,.., M2EC2) in order to do load distribution across filtered machines during each scheduling iteration. It is important to highlight that if we have only one arc from task EC node to machine node, then there is a chance that all the incoming flows (tasks) flow across the single arc, and overloading the single machine with so many tasks even though there are machines with lot of available resources. So to avoid scheduling many tasks on a same machine, we use multiple arcs between task EC and machine nodes using intermediate machine EC nodes. We connect task EC to multiple machine ECs and then these machine ECs are connected to corresponding machine node. The capacity on the arc (task EC to machine EC) and arc (machine EC to machine node) is set to 1. And costs on arcs between task EC and machine ECs are assigned incrementally. Let us take an example where there are three machines M0, M1 & M2 and each machine has a capacity of 2 flows. Load distribution is achieved by assigning two arcs via corresponding machine ECs and the cost on each arc increases incrementally. In this case, arcs connecting to machine ECs for machine M0 have value of cost 100 and 200 respectively. Capacity on these arcs would be 1 each. In a nutshell, for each unit of capacity, there would be corresponding machine EC. The cost on these arcs would increase incremental in order to achieve load distribution across machines. So in summary, we essentially create an equivalence class called ‘Ec mem-cpu’ based on the job id hash value, this will aggregate all the task from a job in one EC. As part of the step to connect arc from Job/Task to the EC, we use cpu and mem only for calculating the normalized cost. |
@deepak-vij Hi, thanks for your answer, but I have some questions:
|
|
@deepak-vij Hi, thanks for your answer, but I still have doubts:
Is there a way to visualize the flow graph currently existing in memory? |
The reason we have all the pods (scheduled and unscheduled) in the flow graph because Firmament has capability of doing rescheduling in every run. Although, by default rescheduling is turned off and we have not tested core Firmament rescheduling logic extensively at this time. It does need some work to smoothen out the kinks that is why we left it alone at this time.
I am not sure about this, there may be boundary condition bug or incorrect formula. You would have to look into the code to resolve this. |
@deepak-vij Hi!
As you say, in order to avoid all pods being scheduled to the node with the minimum cost, the machine EC is introduced. So, the
|
Hi,
I am very interested in your work about modeling scheduling problem using flow graph method, especially using the cpu_cost_model. And I am trying to understand the detail of the modeling process, i.e., the specific process to build the flow graph from scratch based on the state of the cluster and scheduling strategy (cpu_cost_model). But unfortunately, I did not find much documentation about this. And then I spent mach time to read the code. When I read the code, I encountered some difficulties, for example, I can understand the meaning of some functions but it's difficult for me to string them together, and I still don't kown the meaning of PU node. I think relevant documentations will be very helpful.
So I new this issue.
I want to know are there closely related documents about the detail of cpuCostModel. It would be better if there is a simple but realistic modeling case about cpu_cost_model. I really want to finger out the specific process of building flow graph from scratch based on the state of the cluster and cpu_mem scheduling strategy. Or If you have any better suggestions to help me understand these things, thank you a lot.
Thanks again.
The text was updated successfully, but these errors were encountered: