You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Project-wide file upload backoff is too aggressive ...
... when multiple files need to be uploaded for a single task.
GPUGrid (as an example) uploads six files per task, and occasionally suffers network congestion/connection problems.
The client tries to upload each new file at least once, and applies a separate backoff to each file if it fails - starting small for each file, but increasing exponentially (within limits) if successive retries for each file fail.
The project-wide backoff is cumulative - by the time the sixth file is reached, the backoff can already be well over an hour, as it was in this case (I retried the uploads manually):
27/05/2020 09:12:02 | GPUGRID | Sending scheduler request: To fetch work.
27/05/2020 09:12:02 | GPUGRID | Requesting new tasks for NVIDIA GPU
27/05/2020 09:12:02 | GPUGRID | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
27/05/2020 09:12:02 | GPUGRID | [sched_op] NVIDIA GPU work request: 14638.31 seconds; 0.00 devices
27/05/2020 09:12:02 | GPUGRID | [sched_op] Intel GPU work request: 0.00 seconds; 0.00 devices
27/05/2020 09:12:04 | GPUGRID | Scheduler request completed: got 0 new tasks
27/05/2020 09:12:04 | GPUGRID | [sched_op] Server version 613
27/05/2020 09:12:04 | GPUGRID | No tasks sent
27/05/2020 09:12:04 | GPUGRID | This computer has reached a limit on tasks in progress
27/05/2020 09:12:04 | GPUGRID | Project requested delay of 31 seconds
27/05/2020 09:12:04 | GPUGRID | [sched_op] Deferring communication for 00:00:31
27/05/2020 09:12:04 | GPUGRID | [sched_op] Reason: requested by project
27/05/2020 09:54:19 | GPUGRID | Computation for task 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0 finished
27/05/2020 09:54:19 | GPUGRID | Starting task 1c7sA02_348_1-TONI_MDADex2sc-3-50-RND8505_3
27/05/2020 09:54:19 | GPUGRID | [cpu_sched] Starting task 1c7sA02_348_1-TONI_MDADex2sc-3-50-RND8505_3 using acemd3 version 210 (cuda101) in slot 3
27/05/2020 09:54:22 | GPUGRID | Started upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_0
27/05/2020 09:54:22 | GPUGRID | Started upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_1
27/05/2020 09:54:44 | GPUGRID | Temporarily failed upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_0: connect() failed
27/05/2020 09:54:44 | GPUGRID | Backing off 00:02:17 on upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_0
27/05/2020 09:54:44 | GPUGRID | Temporarily failed upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_1: connect() failed
27/05/2020 09:54:44 | GPUGRID | Backing off 00:03:26 on upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_1
27/05/2020 09:54:44 | GPUGRID | Started upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_2
27/05/2020 09:54:44 | GPUGRID | Started upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_8
27/05/2020 09:55:06 | GPUGRID | Temporarily failed upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_2: connect() failed
27/05/2020 09:55:06 | GPUGRID | Backing off 00:03:00 on upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_2
27/05/2020 09:55:06 | GPUGRID | Temporarily failed upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_8: connect() failed
27/05/2020 09:55:06 | GPUGRID | Backing off 00:02:09 on upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_8
27/05/2020 09:55:06 | GPUGRID | Started upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_9
27/05/2020 09:55:06 | GPUGRID | Started upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_10
27/05/2020 09:55:29 | GPUGRID | Temporarily failed upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_9: connect() failed
27/05/2020 09:55:29 | GPUGRID | Backing off 00:02:18 on upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_9
27/05/2020 09:55:29 | GPUGRID | Temporarily failed upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_10: connect() failed
27/05/2020 09:55:29 | GPUGRID | Backing off 00:02:08 on upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_10
27/05/2020 10:03:21 | GPUGRID | Started upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_0
27/05/2020 10:03:21 | GPUGRID | Started upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_1
27/05/2020 10:03:26 | GPUGRID | Finished upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_1
27/05/2020 10:03:26 | GPUGRID | Started upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_2
27/05/2020 10:03:28 | GPUGRID | Finished upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_0
27/05/2020 10:03:28 | GPUGRID | Started upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_8
27/05/2020 10:03:32 | GPUGRID | Finished upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_2
27/05/2020 10:03:32 | GPUGRID | Started upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_9
27/05/2020 10:03:40 | GPUGRID | Finished upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_9
27/05/2020 10:03:40 | GPUGRID | Started upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_10
27/05/2020 10:03:42 | GPUGRID | Finished upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_10
27/05/2020 10:03:54 | GPUGRID | Finished upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_8
27/05/2020 10:03:55 | GPUGRID | [sched_op] Starting scheduler request
27/05/2020 10:03:55 | GPUGRID | Sending scheduler request: To report completed tasks.
27/05/2020 10:03:55 | GPUGRID | Reporting 1 completed tasks
27/05/2020 10:03:55 | GPUGRID | Requesting new tasks for NVIDIA GPU
27/05/2020 10:03:55 | GPUGRID | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
27/05/2020 10:03:55 | GPUGRID | [sched_op] NVIDIA GPU work request: 16850.92 seconds; 0.00 devices
27/05/2020 10:03:55 | GPUGRID | [sched_op] Intel GPU work request: 0.00 seconds; 0.00 devices
27/05/2020 10:04:00 | GPUGRID | Scheduler request completed: got 1 new tasks
System Information
OS: Windows 7 (but probably applies to all)
BOINC Version: win-client_PR2732_2019-04-09_b3fc41a4 (but probably applies to all - we haven't touched upload backoffs for years)
The text was updated successfully, but these errors were encountered:
Here's an example of the problem I'm experiencing - it explains why I'm monitoring so aggressively.
One task has just finished, and generated six output files. Each individual file has tried to upload (in three batches of two) for 21 or 22 seconds - 65 seconds in total. But BOINC has postponed any further attempt for 72 minutes after the individual file backoffs have expired.
The problem arises because GPUGrid (or their hosting agency) has a peculiarly aggressive DDOS sentinel, and sometimes rejects repeat connections until a 2-3 minute cooling off period has elapsed. We con't control that, and nor - seemingly - can they.
Describe the bug
Project-wide file upload backoff is too aggressive ...
... when multiple files need to be uploaded for a single task.
GPUGrid (as an example) uploads six files per task, and occasionally suffers network congestion/connection problems.
The client tries to upload each new file at least once, and applies a separate backoff to each file if it fails - starting small for each file, but increasing exponentially (within limits) if successive retries for each file fail.
The project-wide backoff is cumulative - by the time the sixth file is reached, the backoff can already be well over an hour, as it was in this case (I retried the uploads manually):
27/05/2020 09:12:02 | GPUGRID | Sending scheduler request: To fetch work.
27/05/2020 09:12:02 | GPUGRID | Requesting new tasks for NVIDIA GPU
27/05/2020 09:12:02 | GPUGRID | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
27/05/2020 09:12:02 | GPUGRID | [sched_op] NVIDIA GPU work request: 14638.31 seconds; 0.00 devices
27/05/2020 09:12:02 | GPUGRID | [sched_op] Intel GPU work request: 0.00 seconds; 0.00 devices
27/05/2020 09:12:04 | GPUGRID | Scheduler request completed: got 0 new tasks
27/05/2020 09:12:04 | GPUGRID | [sched_op] Server version 613
27/05/2020 09:12:04 | GPUGRID | No tasks sent
27/05/2020 09:12:04 | GPUGRID | This computer has reached a limit on tasks in progress
27/05/2020 09:12:04 | GPUGRID | Project requested delay of 31 seconds
27/05/2020 09:12:04 | GPUGRID | [sched_op] Deferring communication for 00:00:31
27/05/2020 09:12:04 | GPUGRID | [sched_op] Reason: requested by project
27/05/2020 09:54:19 | GPUGRID | Computation for task 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0 finished
27/05/2020 09:54:19 | GPUGRID | Starting task 1c7sA02_348_1-TONI_MDADex2sc-3-50-RND8505_3
27/05/2020 09:54:19 | GPUGRID | [cpu_sched] Starting task 1c7sA02_348_1-TONI_MDADex2sc-3-50-RND8505_3 using acemd3 version 210 (cuda101) in slot 3
27/05/2020 09:54:22 | GPUGRID | Started upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_0
27/05/2020 09:54:22 | GPUGRID | Started upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_1
27/05/2020 09:54:44 | GPUGRID | Temporarily failed upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_0: connect() failed
27/05/2020 09:54:44 | GPUGRID | Backing off 00:02:17 on upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_0
27/05/2020 09:54:44 | GPUGRID | Temporarily failed upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_1: connect() failed
27/05/2020 09:54:44 | GPUGRID | Backing off 00:03:26 on upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_1
27/05/2020 09:54:44 | GPUGRID | Started upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_2
27/05/2020 09:54:44 | GPUGRID | Started upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_8
27/05/2020 09:55:06 | GPUGRID | Temporarily failed upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_2: connect() failed
27/05/2020 09:55:06 | GPUGRID | Backing off 00:03:00 on upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_2
27/05/2020 09:55:06 | GPUGRID | Temporarily failed upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_8: connect() failed
27/05/2020 09:55:06 | GPUGRID | Backing off 00:02:09 on upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_8
27/05/2020 09:55:06 | GPUGRID | Started upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_9
27/05/2020 09:55:06 | GPUGRID | Started upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_10
27/05/2020 09:55:29 | GPUGRID | Temporarily failed upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_9: connect() failed
27/05/2020 09:55:29 | GPUGRID | Backing off 00:02:18 on upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_9
27/05/2020 09:55:29 | GPUGRID | Temporarily failed upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_10: connect() failed
27/05/2020 09:55:29 | GPUGRID | Backing off 00:02:08 on upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_10
27/05/2020 10:03:21 | GPUGRID | Started upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_0
27/05/2020 10:03:21 | GPUGRID | Started upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_1
27/05/2020 10:03:26 | GPUGRID | Finished upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_1
27/05/2020 10:03:26 | GPUGRID | Started upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_2
27/05/2020 10:03:28 | GPUGRID | Finished upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_0
27/05/2020 10:03:28 | GPUGRID | Started upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_8
27/05/2020 10:03:32 | GPUGRID | Finished upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_2
27/05/2020 10:03:32 | GPUGRID | Started upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_9
27/05/2020 10:03:40 | GPUGRID | Finished upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_9
27/05/2020 10:03:40 | GPUGRID | Started upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_10
27/05/2020 10:03:42 | GPUGRID | Finished upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_10
27/05/2020 10:03:54 | GPUGRID | Finished upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_8
27/05/2020 10:03:55 | GPUGRID | [sched_op] Starting scheduler request
27/05/2020 10:03:55 | GPUGRID | Sending scheduler request: To report completed tasks.
27/05/2020 10:03:55 | GPUGRID | Reporting 1 completed tasks
27/05/2020 10:03:55 | GPUGRID | Requesting new tasks for NVIDIA GPU
27/05/2020 10:03:55 | GPUGRID | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
27/05/2020 10:03:55 | GPUGRID | [sched_op] NVIDIA GPU work request: 16850.92 seconds; 0.00 devices
27/05/2020 10:03:55 | GPUGRID | [sched_op] Intel GPU work request: 0.00 seconds; 0.00 devices
27/05/2020 10:04:00 | GPUGRID | Scheduler request completed: got 1 new tasks
System Information
The text was updated successfully, but these errors were encountered: