You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When a client requests work, the request is expressed in core-seconds. The server doesn't factor in the core usage.
Expected behavior
The server needs to estimate the real cpu usage, across all applicable cores, for MT tasks.
Log entries
14/01/2021 19:30:21 | | [work_fetch] Request work fetch: project work fetch resumed by user
14/01/2021 19:30:22 | | [work_fetch] target work buffer: 8640.00 + 864.00 sec
14/01/2021 19:30:22 | | [work_fetch] shortfall 38016.00 nidle 4.00 saturated 0.00 busy 0.00
14/01/2021 19:30:22 | Milkyway@Home | [sched_op] CPU work request: 38016.00 seconds; 4.00 devices
14/01/2021 19:30:24 | Milkyway@Home | Scheduler request completed: got 23 new tasks
14/01/2021 19:30:24 | Milkyway@Home | [sched_op] estimated total CPU task duration: 39233 seconds
11/01/2021 12:00:39 | PrimeGrid | CPU needs work - buffer low
11/01/2021 12:00:39 | PrimeGrid | [work_fetch] request: CPU (38016.00 sec, 4.00 inst) Intel GPU (0.00 sec, 0.00 inst)
11/01/2021 12:00:39 | PrimeGrid | [sched_op] CPU work request: 38016.00 seconds; 4.00 devices
11/01/2021 12:00:40 | PrimeGrid | Scheduler request completed: got 48 new tasks
11/01/2021 12:00:40 | PrimeGrid | [sched_op] estimated total CPU task duration: 56610 seconds
In both cases, the 'target work buffer' is expressed in wall time - 9,504 seconds, or about 2 hours 40 minutes.
But the 'request' is for four times that much - over 10 hours of core-time.
In both these cases, the scheduler went on adding MT tasks to the reply as if single threaded until the work request was fulfilled. Work for over ten hours of wall-time was delivered (more for PrimeGrid, which still uses DCF).
This issue has been raised again by the CPDN project, which is testing and preparing to release a new multi-threaded application to process IFS climate models. These will be large tasks, with heavy resource demand: this bug will significantly delay the climate research, because too many tasks will be downloaded by the initial few machines.
We're heading towards a new server release to facilitate #4871 - it would be nice if somebody could code the solution to this trivial oversight before then. But I don't have access to a project server for testing, and I won't code without being able to test my own work.
Describe the bug
When a client requests work, the request is expressed in core-seconds. The server doesn't factor in the core usage.
Expected behavior
The server needs to estimate the real cpu usage, across all applicable cores, for MT tasks.
Log entries
In both cases, the 'target work buffer' is expressed in wall time - 9,504 seconds, or about 2 hours 40 minutes.
But the 'request' is for four times that much - over 10 hours of core-time.
In both these cases, the scheduler went on adding MT tasks to the reply as if single threaded until the work request was fulfilled. Work for over ten hours of wall-time was delivered (more for PrimeGrid, which still uses DCF).
The server estimates task duration at
https://github.com/BOINC/boinc/blob/master/sched/sched_send.cpp#L429 (estimate_duration_unscaled) and
https://github.com/BOINC/boinc/blob/master/sched/sched_send.cpp#L477 (estimate_duration)
Neither routine considers the core loading of the MT tasks allocated.
avg_ncpus is available in the HOST_USAGE structure, and should be considered in either
estimate_duration_unscaled
orestimate_duration
.The text was updated successfully, but these errors were encountered: