Current implementation of job-specific token will cause large overhead when there are a lot of jobs #5301

hzy46 · 2021-02-10T03:26:02Z

As introduced in #5292 , we create a job specific token for each job submission. However, during a stress test, I find it will bring large overhead to the cluster.

For example:

If N jobs are submitted, N tokens will be created.
The rest-server API verify will call purge. purge will take O(N) time when N tokens is present.
The SSH plugin in these N jobs all need to call rest-server, which will call the verify function internally.
Overall, O(N^2) overhead will be bought to rest-server, database, and api server.

Another potential issue is that: We save all tokens in one secret object. But Kubernetes' objects have 1MB size limit. This will limit the maximum number of jobs the user can run at the same time.

One possible solution is that we just create one job specific token for the user. And delete it when there is no active jobs for the user:

In purge, query the database to find if the user has any incompleted jobs. If there is no such job, remove user's job specific token.
In submit job API, try to create a job specific token if there is no such token. And generate tokenSecretDef for database controller.
The logic of database controller and runtime remains the same.

The text was updated successfully, but these errors were encountered:

suiguoxin · 2021-02-23T03:07:13Z

Fixed in #5310 and microsoft/openpai-runtime#36

suiguoxin self-assigned this Feb 24, 2021

suiguoxin closed this as completed Feb 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Current implementation of job-specific token will cause large overhead when there are a lot of jobs #5301

Current implementation of job-specific token will cause large overhead when there are a lot of jobs #5301

hzy46 commented Feb 10, 2021 •

edited

Loading

suiguoxin commented Feb 23, 2021

Current implementation of job-specific token will cause large overhead when there are a lot of jobs #5301

Current implementation of job-specific token will cause large overhead when there are a lot of jobs #5301

Comments

hzy46 commented Feb 10, 2021 • edited Loading

suiguoxin commented Feb 23, 2021

hzy46 commented Feb 10, 2021 •

edited

Loading