Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

limit to 150 concurrent jobs per workflow #216

Merged
merged 1 commit into from
Dec 3, 2024

Conversation

jameslamb
Copy link
Member

@jameslamb jameslamb commented Dec 3, 2024

Contributes to #162

Since I started paying close attention, I've observed that most CI runs here have at least 10 failures on their first run, with errors indicating we hit dockerhub rate limits:

ERROR: toomanyrequests: Too Many Requests (HAP429).

This proposes using @ajschmidt8 's recommendation from rapidsai/miniforge-cuda#72 (comment) to limit the number of concurrent jobs.

Notes for Reviewers

Benefits of this change

Reduces the impact of this repo on total CPU runner availability for projects using NVIDIA runners.

Reduces the likelihood that a human will have to retrigger a build here (which costs time and money, and is easy to miss on branch builds after merges).

Why set the limit to 150?

This is not an exact science haha. I'm just looking for a number that meets these constraints:

  • fewer concurrent jobs than this repo currently uses
  • does not increase CI times too much

Some relevant information:

  • current number of build jobs per workflow run: 270
  • worst-case (uncached) DockerHub image pulls per workflow run: 450
    • ci-conda / miniforge-cuda:
      • pulls: 4 (nvidia/cuda, condaforge/miniforge3, mikefarah/yq, amazon/aws-cli)
      • builds: 90
      • total: 360
    • ci-wheel:
      • pulls: 1 (amazon/aws-cli... base image is from NVCR)
      • builds: 12
      • total: 12
    • citestwheel:
      • pulls: 1 (amazon/aws-cli ... base image is from NVCR)
      • builds: 78
      • total: 78

So assuming full availability of linux-{aarch64,amd64}-cpu runners, and if all build jobs take roughly the same amount of time, changing from "unlimited" to 150 might mean roughly 1.8x the end-to-end time for a CI run here... from around 11 minutes to maybe 20 minutes.

I have no idea what the exact limit is from DockerHub. From https://docs.docker.com/docker-hub/download-rate-limit/#other-limits:

Docker Hub also has an overall rate limit... This limit applies to all requests to Hub properties including web pages, APIs, and image pulls. The limit is applied per-IP, and... the limit changes over time...it's in the order of thousands of requests per minute. ... [and] applies to all users equally regardless of account level.

The "overall limit" returns a simple 429 Too Many Requests response. The pull limit returns a longer error message that includes a link to this page.

@jameslamb jameslamb added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Dec 3, 2024
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this works on the first run, we can merge it. I agree it's not an exact science, and 150 seems like a reasonable value.

@jameslamb
Copy link
Member Author

Also worth noting... we got really unlucky with timing here. CI has been running for over an hour, but that's because around the time I started it, we hit a resource limit in the middle of switching regions for the underlying CI runners (@ajschmidt8 provided more details offline).

So yeah, if this passes I think we should merge it and then modify this value based on our experience with the next couple builds.

And also think we should leave #162 open until we go a while without encountering these rate limits.

@jameslamb jameslamb changed the title WIP: limit to 150 concurrent jobs per workflow limit to 150 concurrent jobs per workflow Dec 3, 2024
@jameslamb jameslamb marked this pull request as ready for review December 3, 2024 17:03
@jameslamb jameslamb requested a review from a team as a code owner December 3, 2024 17:03
@jameslamb jameslamb requested review from KyleFromNVIDIA and removed request for a team December 3, 2024 17:03
@jameslamb jameslamb merged commit a5b59e8 into rapidsai:main Dec 3, 2024
406 checks passed
@jameslamb jameslamb deleted the rate-limits branch December 3, 2024 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improves an existing functionality non-breaking Introduces a non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants