Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide support for large job arguments #1501

Merged
merged 22 commits into from
Sep 26, 2024
Merged

Conversation

psschwei
Copy link
Collaborator

Summary

Provide support for passing large arguments (>1MB) for programs / functions. Utility-scale circuits, for example.

Details and comments

We used to use the Ray environmental variable ENV_JOB_ARGUMENTS to pass arguments to programs at runtime. While this is fine for small arguments, Ray wasn't able to handle larger ones (like circuits at utlity scale using 100+ qubits), which resulted in the job being stuck in the pending / initializing phase.

So instead of trying to pass arguments by environmental variable, this PR updates the submission flow to write arguments to a file (arguments.serverless, overwriting any existing file) in the working directory so it will be included when the job is sent to the Ray cluster.

Since this is happening in the gateway, it won't run into the MAX_ARTIFACT_FILE_SIZE_MB limitation, which is checked in the client before arguments would be added.

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
@psschwei
Copy link
Collaborator Author

Still todo: add tests for the new argument reading functionality

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
@psschwei
Copy link
Collaborator Author

🎉

@Tansito
Copy link
Member

Tansito commented Sep 24, 2024

Thank you so much @psschwei ! Really awesome work 🙏

Tomorrow I will take a look calmly. @IceKhan13 , @akihikokuroda this feature is important and introduces some good changes I would appreciate if you can invest some time reviewing it too ❤️

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
akihikokuroda
akihikokuroda previously approved these changes Sep 25, 2024
Copy link
Collaborator

@akihikokuroda akihikokuroda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

Copy link
Member

@Tansito Tansito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@psschwei this looks really good, thank you so much. I only have one question:

I understand that this is a middle term solution for two reasons:

  • we are limiting the execution to one Job per user because we are using the same file-name for each Job.
  • we continue storing in the database the arguments and the final idea is to store those arguments in the COS.

are my assumptions correct reading the PR? (I think as a temporal patch is fine, it is just to be sure).

@psschwei
Copy link
Collaborator Author

  • we are limiting the execution to one Job per user because we are using the same file-name for each Job.

It's the same file name for each job, but each job is stored in a different location. When we submit a job, we create a temporary directory for the artifact:

working_directory_for_upload = os.path.join(
sanitize_file_path(str(settings.MEDIA_ROOT)),
"tmp",
str(uuid.uuid4()),
)

and we save the arguments.serverless file in that temporary directory. So each job should have its own distinct arguments file.

  • we continue storing in the database the arguments and the final idea is to store those arguments in the COS.

I don't know... the more I think about it, the less I like the idea of putting the arguments in COS. Partially because it adds a layer of complexity that I don't think we need (splitting the job data between DB and COS). Partially because it ended up being pretty easy to add the arguments as a file rather than an envvar. And partially because if its a question of needing to save space, I think it might be better to be a bit more aggressive about somehow archiving DB records rather than trying to split them. But we can discuss it more on our next sync.

@Tansito
Copy link
Member

Tansito commented Sep 25, 2024

It's the same file name for each job, but each job is stored in a different location. When we submit a job, we create a temporary directory for the artifact

I need to admit that I was not aware of this part in the logic of the submit. Are we removing these entries at some point? Do you know?

we can discuss it more on our next sync.

I see your points. Definitely we will need to analyze it.

Other thing that I just figured out analyzing the code is that this is a breaking change, isn't it? How we are changing the get_arguments in the client functions using the old get_arguments will need a migration from current functions.

@psschwei
Copy link
Collaborator Author

Are we removing these entries at some point? Do you know?

I don't think so... we should probably look into setting an expiration policy

the code is that this is a breaking change, isn't it?

Hmm... yeah, it is. This will require folks to upgrade their client. But I don't think there's any way around that -- seems like the max size of an envvar in Kubernetes is ~1MB.

Well, in theory we could do something like set the envvar if the args are <1MB. So anybody impacted by this issue would have to upgrade, but those who weren't impacted could keep going as before.... depends how much impact a breaking change would have...

@Tansito
Copy link
Member

Tansito commented Sep 25, 2024

I don't think so... we should probably look into setting an expiration policy

Yeah, it has sense. We can discuss about this in our sync next week too. I was thinking that maybe once time a Job finishes we can cleanup resources or something like that.

depends how much impact a breaking change would have...

Let me open the conversation with, Paco. I think it's a good opportunity to test a process around this. Thanks Paul 🙏

IceKhan13
IceKhan13 previously approved these changes Sep 25, 2024
Copy link
Member

@IceKhan13 IceKhan13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

On topic of storing args in 2 locations: I think I agree with you on need to decide which approach to take. Maybe have args, results and logs in storage and archive them after some point. And remove them from DB or leave only pointers. But for that we should be careful on writing data to users folder

@Tansito Tansito added the on-hold On hold due to any reason label Sep 25, 2024
@Tansito
Copy link
Member

Tansito commented Sep 25, 2024

@psschwei @IceKhan13 @akihikokuroda I'm going to put this on-hold taking into account that introduces a breaking change so we cannot merge it right now. Let's try to find a way to be able to avoid the breaking change. I'm going to try to think in a proposal, feel free to propose something too 👍

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
@psschwei
Copy link
Collaborator Author

from a slack discussion, here's the new plan:

after update, server sets:
* ENV_JOB_ARGUMENTS envvar
* serverless.arguments file

old client:
* reads from ENV_JOB_ARGUMENTS 

new client:
*  reads from serverless.arguments

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
@psschwei
Copy link
Collaborator Author

Note that the logs noting that arguments were/weren't passed through the envvar are in the gateway pod, not the scheduler.

@psschwei psschwei dismissed stale reviews from IceKhan13 and akihikokuroda September 26, 2024 12:53

changes

@psschwei psschwei removed the on-hold On hold due to any reason label Sep 26, 2024
@psschwei
Copy link
Collaborator Author

to test the backwards compatibility, deploy the latest gateway code but set v0.16.3 for the ray node in the rayclustertemplate

@Tansito Tansito self-requested a review September 26, 2024 12:57
Copy link
Member

@Tansito Tansito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @psschwei . I was running some tests and it seems everything is working too but we can run some more tests in staging before promote the release to production.

@psschwei psschwei merged commit c0d0409 into Qiskit:main Sep 26, 2024
10 checks passed
@psschwei psschwei deleted the job-args branch September 26, 2024 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants