-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide support for large job arguments #1501
Conversation
Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
Still todo: add tests for the new argument reading functionality |
Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
🎉 |
Thank you so much @psschwei ! Really awesome work 🙏 Tomorrow I will take a look calmly. @IceKhan13 , @akihikokuroda this feature is important and introduces some good changes I would appreciate if you can invest some time reviewing it too ❤️ |
Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@psschwei this looks really good, thank you so much. I only have one question:
I understand that this is a middle term solution for two reasons:
- we are limiting the execution to one
Job
per user because we are using the same file-name for each Job. - we continue storing in the database the arguments and the final idea is to store those arguments in the COS.
are my assumptions correct reading the PR? (I think as a temporal patch is fine, it is just to be sure).
It's the same file name for each job, but each job is stored in a different location. When we submit a job, we create a temporary directory for the artifact: qiskit-serverless/gateway/api/ray.py Lines 87 to 91 in 15fcb4e
and we save the
I don't know... the more I think about it, the less I like the idea of putting the arguments in COS. Partially because it adds a layer of complexity that I don't think we need (splitting the job data between DB and COS). Partially because it ended up being pretty easy to add the arguments as a file rather than an envvar. And partially because if its a question of needing to save space, I think it might be better to be a bit more aggressive about somehow archiving DB records rather than trying to split them. But we can discuss it more on our next sync. |
I need to admit that I was not aware of this part in the logic of the submit. Are we removing these entries at some point? Do you know?
I see your points. Definitely we will need to analyze it. Other thing that I just figured out analyzing the code is that this is a breaking change, isn't it? How we are changing the |
I don't think so... we should probably look into setting an expiration policy
Hmm... yeah, it is. This will require folks to upgrade their client. But I don't think there's any way around that -- seems like the max size of an envvar in Kubernetes is ~1MB. Well, in theory we could do something like set the envvar if the args are <1MB. So anybody impacted by this issue would have to upgrade, but those who weren't impacted could keep going as before.... depends how much impact a breaking change would have... |
Yeah, it has sense. We can discuss about this in our sync next week too. I was thinking that maybe once time a Job finishes we can cleanup resources or something like that.
Let me open the conversation with, Paco. I think it's a good opportunity to test a process around this. Thanks Paul 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
On topic of storing args in 2 locations: I think I agree with you on need to decide which approach to take. Maybe have args, results and logs in storage and archive them after some point. And remove them from DB or leave only pointers. But for that we should be careful on writing data to users folder
@psschwei @IceKhan13 @akihikokuroda I'm going to put this on-hold taking into account that introduces a breaking change so we cannot merge it right now. Let's try to find a way to be able to avoid the breaking change. I'm going to try to think in a proposal, feel free to propose something too 👍 |
Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
from a slack discussion, here's the new plan:
|
Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
Note that the logs noting that arguments were/weren't passed through the envvar are in the gateway pod, not the scheduler. |
to test the backwards compatibility, deploy the latest gateway code but set v0.16.3 for the ray node in the rayclustertemplate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you @psschwei . I was running some tests and it seems everything is working too but we can run some more tests in staging before promote the release to production.
Summary
Provide support for passing large arguments (>1MB) for programs / functions. Utility-scale circuits, for example.
Details and comments
We used to use the Ray environmental variable
ENV_JOB_ARGUMENTS
to pass arguments to programs at runtime. While this is fine for small arguments, Ray wasn't able to handle larger ones (like circuits at utlity scale using 100+ qubits), which resulted in the job being stuck in the pending / initializing phase.So instead of trying to pass arguments by environmental variable, this PR updates the submission flow to write arguments to a file (
arguments.serverless
, overwriting any existing file) in the working directory so it will be included when the job is sent to the Ray cluster.Since this is happening in the gateway, it won't run into the
MAX_ARTIFACT_FILE_SIZE_MB
limitation, which is checked in the client before arguments would be added.