Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Cancel Job Attachments session action when transfer rates drop below threshold #143

Merged
merged 1 commit into from
Feb 23, 2024

Conversation

gahyusuh
Copy link
Contributor

What was the problem/requirement? (What/Why)

During the SYNC_INPUT_JOB_ATTACHMENTS session actions, there might be a potential for the download operation to get stuck or stall due to unexpected issues. Currently, there is no mechanism to monitor or detect such scenarios. This poses a risk, especially when downloading many/large files or using slow/unstable network.

What was the solution? (How)

  • Leverages the on_downloading_files callback function, which is passed to the AssetSync's sync_inputs function to track the the file transfer progress. This callback is intended to be called at regular intervals - 1 second at the longest.
  • Monitors the transfer rate from the progress report delivered via the callback function. If it detects a series of alarmingly low transfer rates (< LOW_TRANSFER_RATE_THRESHOLD), and if the count exceeds the specified threshold (>= LOW_TRANSFER_COUNT_THRESHOLD), cancels the download and fails the current (SYNC_INPUT_JOB_ATTACHMENTS) action.

What is the impact of this change?

Worker agent can now detect and halt potentially stuck SYNC_INPUT_JOB_ATTACHMENTS session actions. This ensures that resources are not wasted on prolonged stalled operations, and that users are timely notified about the issues.

How was this change tested?

  • hatch run lint && hatch run test

  • Run an end-to-end test (submitting a non-rendering job bundle --> running a CMF)

    • Happy-path: ensured that there were no issues/errors until output files were generated as expected.
    • When Job Attachment's AssetSync (intentionally) reported the download speed lower than the speed threshold for the time longer than the time threshold:
    Canceled sessionaction-674e8fc017e24d7b94f8dd98eefbbf4f-1 as NEVER_ATTEMPTED
    INFO     [session-674e8fc017e24d7b94f8dd98eefbbf4f] [sessionaction-674e8fc017e24d7b94f8dd98eefbbf4f-0] (job.sync_input_job_attachments()): Action completed as FAILED
    INFO     Synchronizing with service (sending UpdateWorkerSchedule)
    INFO     Updating actions: {'sessionaction-674e8fc017e24d7b94f8dd98eefbbf4f-0': {'startedAt': datetime.datetime(2024, 1, 30, 17, 43, 49, 135243, tzinfo=datetime.timezone.utc), 'completedStatus': 'FAILED', 'progressMessage': 'Input syncing failed due to successive low transfer rates (< 10 Kb/s). The transfer rate was below the threshold for the last 1 minute.', 'endedAt': datetime.datetime(2024, 1, 30, 17, 44, 16, 518941, tzinfo=datetime.timezone.utc)}, 'sessionaction-674e8fc017e24d7b94f8dd98eefbbf4f-1': {'completedStatus': 'NEVER_ATTEMPTED', 'progressMessage': 'Input syncing failed due to successive low transfer rates (< 10 Kb/s). The transfer rate was below the threshold for the last 1 minute.'}}
    

    Also, confirmed that a telemetry data capturing this sync-inputs failure was sent correctly.

    {
      "event_timestamp":1706636627000,
      "event_type":"com.amazon.rum.deadline.worker_agent.sync_inputs_failure",
      "event_id":"16c2f820-6d7d-40b5-b777-df2f565a93d1",
      "event_version":"1.0.0",
      "log_stream":"2024-01-30T10",
      "application_id":"00cf4815-a611-42f0-85ac-1e05b995e6e6",
      "metadata": {
        "version":"0.20.0",
        "osName":"Linux",
        "osVersion":"5.10.205-172.807.amzn2int.x86_64",
        "domain":"*.madstudio.aws.dev",
        "service":"deadline-cloud-worker-agent",
        "python_version":"3.10.8",
        "countryCode":"US",
        "subdivisionCode":"OR"
      },
      "user_details":{
        "sessionId":"779308b4-9b63-4c94-8e60-0da0a0155169",
        "userId":"e726a3a4-d7c4-4e32-9610-dc333200a625"
      },
      "event_details":{
        "queue_id":"queue-a0c93fade6de47009999b273cc5b6249",
        "failure_reason":"Insufficient download speed: Input syncing failed due to successive low transfer rates (< 10 Kb/s). The transfer rate was below the threshold for the last 1 minute."
      }
    }
    

Was this change documented?

No.

Is this a breaking change?

No.

@gahyusuh gahyusuh force-pushed the gahyusuh/sija_cancel_due_to_low_speed branch from 4b46f84 to 51eeb36 Compare January 30, 2024 18:16
@gahyusuh gahyusuh marked this pull request as ready for review January 30, 2024 18:32
@gahyusuh gahyusuh requested a review from a team as a code owner January 30, 2024 18:32
action_status=ActionStatus(
state=ActionState.FAILED,
fail_message=(
"Input syncing failed due to successive low transfer rates (< 10 Kb/s). "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put this rate in based on the LOW_TRANSFER_RATE_THRESHOLD value? If that constant is changed, this message will be inaccurate, and that could be confusing to diagnose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I have addressed it in the revision.

state=ActionState.FAILED,
fail_message=(
"Input syncing failed due to successive low transfer rates (< 10 Kb/s). "
"The transfer rate was below the threshold for the last 1 minute."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same applies to the "1 minute" here, if the constant defining that is changed, this could be confusing.

failure_reason=(
"Insufficient download speed: "
"Input syncing failed due to successive low transfer rates (< 10 Kb/s). "
"The transfer rate was below the threshold for the last 1 minute."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about the message to constant consistency here, maybe it should be taking the message from the ActionStatus.

@gahyusuh gahyusuh force-pushed the gahyusuh/sija_cancel_due_to_low_speed branch 2 times, most recently from 32aa9a2 to 9eafde2 Compare February 12, 2024 14:57
pyproject.toml Outdated
"E501",
"E722",
"F811",
]
line-length = 100

[tool.ruff.isort]
[tool.ruff.lint.isort]
Copy link
Contributor Author

@gahyusuh gahyusuh Feb 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated these option names, as I was getting the below warnings when running hatch run fmt or hatch run lint

warning: The top-level linter settings are deprecated in favour of their counterparts in the `lint` section. Please update the following options in `pyproject.toml`:
  - 'ignore' -> 'lint.ignore'
  - 'isort' -> 'lint.isort'

marofke
marofke previously approved these changes Feb 12, 2024
Copy link
Contributor

@marofke marofke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great to see some extra safeguards being added! 🚢 🦈

…below threshold

Signed-off-by: Gahyun Suh <132245153+gahyusuh@users.noreply.github.com>
@gahyusuh gahyusuh force-pushed the gahyusuh/sija_cancel_due_to_low_speed branch from bdde066 to ca77fa0 Compare February 23, 2024 17:21
@gahyusuh gahyusuh merged commit c49bbb4 into mainline Feb 23, 2024
9 checks passed
@gahyusuh gahyusuh deleted the gahyusuh/sija_cancel_due_to_low_speed branch February 23, 2024 18:23
gmchale79 pushed a commit that referenced this pull request Mar 11, 2024
…below threshold (#143)

Signed-off-by: Gahyun Suh <132245153+gahyusuh@users.noreply.github.com>
Signed-off-by: Graeme McHale <gmchale@amazon.com>
jusiskin pushed a commit to jusiskin/deadline-cloud-worker-agent that referenced this pull request Sep 4, 2024
Problem:

The PosixInstanceWorker and WindowsInstanceWorker classes implement a
CMF Worker based on an ec2 host. In both cases they start with an empty
AMI image (AL2023 or Win Server 2022) then set up users, install the
agent, etc. Tests that need to use an AMI that already has users and the
agent installed end up having to hack around a lot of assumptions in
these classes, or reimplement everything for an ec2-based worker from
scratch.

Solution:

Refactor the PosixInstanceWorker and WindowsInstanceWorker to split them
into os-specific base classes that contain all of the generic worker
implementation stuff (starting the worker, getting its id, etc), and a
derived class that implements the user-setup, agent install, and the
like. This makes space for tests that use an AMI with the users & agent
pre-setup to derive from the os-specific base classes to get the generic
ec2-worker test functionality.

BREAKING CHANGE: PosixInstanceWorker has been renamed to
PosixInstanceBuildWorker, and WindowsInstanceWorker has been renamed to
WindowsInstanceBuildWorker.

Signed-off-by: Daniel Neilson <53624638+ddneilson@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants