Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client: project-wide file upload backoff is too aggressive ... #3778

Closed
RichardHaselgrove opened this issue May 27, 2020 · 2 comments · Fixed by #4575
Closed

Client: project-wide file upload backoff is too aggressive ... #3778

RichardHaselgrove opened this issue May 27, 2020 · 2 comments · Fixed by #4575

Comments

@RichardHaselgrove
Copy link
Contributor

Describe the bug
Project-wide file upload backoff is too aggressive ...
... when multiple files need to be uploaded for a single task.

GPUGrid (as an example) uploads six files per task, and occasionally suffers network congestion/connection problems.

The client tries to upload each new file at least once, and applies a separate backoff to each file if it fails - starting small for each file, but increasing exponentially (within limits) if successive retries for each file fail.

The project-wide backoff is cumulative - by the time the sixth file is reached, the backoff can already be well over an hour, as it was in this case (I retried the uploads manually):

27/05/2020 09:12:02 | GPUGRID | Sending scheduler request: To fetch work.
27/05/2020 09:12:02 | GPUGRID | Requesting new tasks for NVIDIA GPU
27/05/2020 09:12:02 | GPUGRID | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
27/05/2020 09:12:02 | GPUGRID | [sched_op] NVIDIA GPU work request: 14638.31 seconds; 0.00 devices
27/05/2020 09:12:02 | GPUGRID | [sched_op] Intel GPU work request: 0.00 seconds; 0.00 devices
27/05/2020 09:12:04 | GPUGRID | Scheduler request completed: got 0 new tasks
27/05/2020 09:12:04 | GPUGRID | [sched_op] Server version 613
27/05/2020 09:12:04 | GPUGRID | No tasks sent
27/05/2020 09:12:04 | GPUGRID | This computer has reached a limit on tasks in progress
27/05/2020 09:12:04 | GPUGRID | Project requested delay of 31 seconds
27/05/2020 09:12:04 | GPUGRID | [sched_op] Deferring communication for 00:00:31
27/05/2020 09:12:04 | GPUGRID | [sched_op] Reason: requested by project
27/05/2020 09:54:19 | GPUGRID | Computation for task 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0 finished
27/05/2020 09:54:19 | GPUGRID | Starting task 1c7sA02_348_1-TONI_MDADex2sc-3-50-RND8505_3
27/05/2020 09:54:19 | GPUGRID | [cpu_sched] Starting task 1c7sA02_348_1-TONI_MDADex2sc-3-50-RND8505_3 using acemd3 version 210 (cuda101) in slot 3
27/05/2020 09:54:22 | GPUGRID | Started upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_0
27/05/2020 09:54:22 | GPUGRID | Started upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_1
27/05/2020 09:54:44 | GPUGRID | Temporarily failed upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_0: connect() failed
27/05/2020 09:54:44 | GPUGRID | Backing off 00:02:17 on upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_0
27/05/2020 09:54:44 | GPUGRID | Temporarily failed upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_1: connect() failed
27/05/2020 09:54:44 | GPUGRID | Backing off 00:03:26 on upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_1
27/05/2020 09:54:44 | GPUGRID | Started upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_2
27/05/2020 09:54:44 | GPUGRID | Started upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_8
27/05/2020 09:55:06 | GPUGRID | Temporarily failed upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_2: connect() failed
27/05/2020 09:55:06 | GPUGRID | Backing off 00:03:00 on upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_2
27/05/2020 09:55:06 | GPUGRID | Temporarily failed upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_8: connect() failed
27/05/2020 09:55:06 | GPUGRID | Backing off 00:02:09 on upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_8
27/05/2020 09:55:06 | GPUGRID | Started upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_9
27/05/2020 09:55:06 | GPUGRID | Started upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_10
27/05/2020 09:55:29 | GPUGRID | Temporarily failed upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_9: connect() failed
27/05/2020 09:55:29 | GPUGRID | Backing off 00:02:18 on upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_9
27/05/2020 09:55:29 | GPUGRID | Temporarily failed upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_10: connect() failed
27/05/2020 09:55:29 | GPUGRID | Backing off 00:02:08 on upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_10
27/05/2020 10:03:21 | GPUGRID | Started upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_0
27/05/2020 10:03:21 | GPUGRID | Started upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_1
27/05/2020 10:03:26 | GPUGRID | Finished upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_1
27/05/2020 10:03:26 | GPUGRID | Started upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_2
27/05/2020 10:03:28 | GPUGRID | Finished upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_0
27/05/2020 10:03:28 | GPUGRID | Started upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_8
27/05/2020 10:03:32 | GPUGRID | Finished upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_2
27/05/2020 10:03:32 | GPUGRID | Started upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_9
27/05/2020 10:03:40 | GPUGRID | Finished upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_9
27/05/2020 10:03:40 | GPUGRID | Started upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_10
27/05/2020 10:03:42 | GPUGRID | Finished upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_10
27/05/2020 10:03:54 | GPUGRID | Finished upload of 1a5cA00_413_3-TONI_MDADex2sa-3-50-RND5174_0_8
27/05/2020 10:03:55 | GPUGRID | [sched_op] Starting scheduler request
27/05/2020 10:03:55 | GPUGRID | Sending scheduler request: To report completed tasks.
27/05/2020 10:03:55 | GPUGRID | Reporting 1 completed tasks
27/05/2020 10:03:55 | GPUGRID | Requesting new tasks for NVIDIA GPU
27/05/2020 10:03:55 | GPUGRID | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
27/05/2020 10:03:55 | GPUGRID | [sched_op] NVIDIA GPU work request: 16850.92 seconds; 0.00 devices
27/05/2020 10:03:55 | GPUGRID | [sched_op] Intel GPU work request: 0.00 seconds; 0.00 devices
27/05/2020 10:04:00 | GPUGRID | Scheduler request completed: got 1 new tasks

System Information

  • OS: Windows 7 (but probably applies to all)
  • BOINC Version: win-client_PR2732_2019-04-09_b3fc41a4 (but probably applies to all - we haven't touched upload backoffs for years)
@RichardHaselgrove
Copy link
Contributor Author

Here's an example of the problem I'm experiencing - it explains why I'm monitoring so aggressively.

One task has just finished, and generated six output files. Each individual file has tried to upload (in three batches of two) for 21 or 22 seconds - 65 seconds in total. But BOINC has postponed any further attempt for 72 minutes after the individual file backoffs have expired.

Project backoff

The problem arises because GPUGrid (or their hosting agency) has a peculiarly aggressive DDOS sentinel, and sometimes rejects repeat connections until a 2-3 minute cooling off period has elapsed. We con't control that, and nor - seemingly - can they.

@RichardHaselgrove
Copy link
Contributor Author

Refer to #909 (comment)

I think there's an unstated assumption here that each task produces one upload file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants