Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Current implementation of job-specific token will cause large overhead when there are a lot of jobs #5301

Closed
hzy46 opened this issue Feb 10, 2021 · 1 comment
Assignees

Comments

@hzy46
Copy link
Contributor

hzy46 commented Feb 10, 2021

As introduced in #5292 , we create a job specific token for each job submission. However, during a stress test, I find it will bring large overhead to the cluster.

For example:

  • If N jobs are submitted, N tokens will be created.
  • The rest-server API verify will call purge. purge will take O(N) time when N tokens is present.
  • The SSH plugin in these N jobs all need to call rest-server, which will call the verify function internally.
  • Overall, O(N^2) overhead will be bought to rest-server, database, and api server.

Another potential issue is that: We save all tokens in one secret object. But Kubernetes' objects have 1MB size limit. This will limit the maximum number of jobs the user can run at the same time.

One possible solution is that we just create one job specific token for the user. And delete it when there is no active jobs for the user:

  • In purge, query the database to find if the user has any incompleted jobs. If there is no such job, remove user's job specific token.
  • In submit job API, try to create a job specific token if there is no such token. And generate tokenSecretDef for database controller.
  • The logic of database controller and runtime remains the same.
@suiguoxin
Copy link
Member

Fixed in #5310 and microsoft/openpai-runtime#36

@suiguoxin suiguoxin self-assigned this Feb 24, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants