Provide support for large job arguments #1501

psschwei · 2024-09-24T21:01:47Z

Summary

Provide support for passing large arguments (>1MB) for programs / functions. Utility-scale circuits, for example.

Details and comments

We used to use the Ray environmental variable ENV_JOB_ARGUMENTS to pass arguments to programs at runtime. While this is fine for small arguments, Ray wasn't able to handle larger ones (like circuits at utlity scale using 100+ qubits), which resulted in the job being stuck in the pending / initializing phase.

So instead of trying to pass arguments by environmental variable, this PR updates the submission flow to write arguments to a file (arguments.serverless, overwriting any existing file) in the working directory so it will be included when the job is sent to the Ray cluster.

Since this is happening in the gateway, it won't run into the MAX_ARTIFACT_FILE_SIZE_MB limitation, which is checked in the client before arguments would be added.

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

psschwei · 2024-09-24T21:02:33Z

Still todo: add tests for the new argument reading functionality

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

psschwei · 2024-09-24T22:01:10Z

🎉

Tansito · 2024-09-24T23:08:43Z

Thank you so much @psschwei ! Really awesome work 🙏

Tomorrow I will take a look calmly. @IceKhan13 , @akihikokuroda this feature is important and introduces some good changes I would appreciate if you can invest some time reviewing it too ❤️

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

akihikokuroda

LGTM. Thanks!

Tansito

@psschwei this looks really good, thank you so much. I only have one question:

I understand that this is a middle term solution for two reasons:

we are limiting the execution to one Job per user because we are using the same file-name for each Job.
we continue storing in the database the arguments and the final idea is to store those arguments in the COS.

are my assumptions correct reading the PR? (I think as a temporal patch is fine, it is just to be sure).

psschwei · 2024-09-25T19:17:45Z

we are limiting the execution to one Job per user because we are using the same file-name for each Job.

It's the same file name for each job, but each job is stored in a different location. When we submit a job, we create a temporary directory for the artifact:

qiskit-serverless/gateway/api/ray.py

Lines 87 to 91 in 15fcb4e

    
           working_directory_for_upload = os.path.join( 
        
               sanitize_file_path(str(settings.MEDIA_ROOT)), 
        
               "tmp", 
        
               str(uuid.uuid4()), 
        
           )

and we save the arguments.serverless file in that temporary directory. So each job should have its own distinct arguments file.

we continue storing in the database the arguments and the final idea is to store those arguments in the COS.

I don't know... the more I think about it, the less I like the idea of putting the arguments in COS. Partially because it adds a layer of complexity that I don't think we need (splitting the job data between DB and COS). Partially because it ended up being pretty easy to add the arguments as a file rather than an envvar. And partially because if its a question of needing to save space, I think it might be better to be a bit more aggressive about somehow archiving DB records rather than trying to split them. But we can discuss it more on our next sync.

Tansito · 2024-09-25T19:31:39Z

It's the same file name for each job, but each job is stored in a different location. When we submit a job, we create a temporary directory for the artifact

I need to admit that I was not aware of this part in the logic of the submit. Are we removing these entries at some point? Do you know?

we can discuss it more on our next sync.

I see your points. Definitely we will need to analyze it.

Other thing that I just figured out analyzing the code is that this is a breaking change, isn't it? How we are changing the get_arguments in the client functions using the old get_arguments will need a migration from current functions.

psschwei · 2024-09-25T20:10:18Z

Are we removing these entries at some point? Do you know?

I don't think so... we should probably look into setting an expiration policy

the code is that this is a breaking change, isn't it?

Hmm... yeah, it is. This will require folks to upgrade their client. But I don't think there's any way around that -- seems like the max size of an envvar in Kubernetes is ~1MB.

Well, in theory we could do something like set the envvar if the args are <1MB. So anybody impacted by this issue would have to upgrade, but those who weren't impacted could keep going as before.... depends how much impact a breaking change would have...

Tansito · 2024-09-25T20:15:58Z

I don't think so... we should probably look into setting an expiration policy

Yeah, it has sense. We can discuss about this in our sync next week too. I was thinking that maybe once time a Job finishes we can cleanup resources or something like that.

depends how much impact a breaking change would have...

Let me open the conversation with, Paco. I think it's a good opportunity to test a process around this. Thanks Paul 🙏

IceKhan13

LGTM.

On topic of storing args in 2 locations: I think I agree with you on need to decide which approach to take. Maybe have args, results and logs in storage and archive them after some point. And remove them from DB or leave only pointers. But for that we should be careful on writing data to users folder

Tansito · 2024-09-25T20:31:06Z

@psschwei @IceKhan13 @akihikokuroda I'm going to put this on-hold taking into account that introduces a breaking change so we cannot merge it right now. Let's try to find a way to be able to avoid the breaking change. I'm going to try to think in a proposal, feel free to propose something too 👍

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

psschwei · 2024-09-25T23:38:37Z

from a slack discussion, here's the new plan:

after update, server sets:
* ENV_JOB_ARGUMENTS envvar
* serverless.arguments file

old client:
* reads from ENV_JOB_ARGUMENTS 

new client:
*  reads from serverless.arguments

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

psschwei · 2024-09-26T01:11:17Z

Note that the logs noting that arguments were/weren't passed through the envvar are in the gateway pod, not the scheduler.

changes

psschwei · 2024-09-26T12:56:20Z

to test the backwards compatibility, deploy the latest gateway code but set v0.16.3 for the ray node in the rayclustertemplate

Tansito

LGTM, thank you @psschwei . I was running some tests and it seems everything is working too but we can run some more tests in staging before promote the release to production.

pass arguments by file instead of by envvar

6cf6f11

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

psschwei added 9 commits September 24, 2024 17:06

lint

ec21b87

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

more lint

61e5d82

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

lint

d60b7b9

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

lint

06ae134

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

lint

0307536

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

update tests

712a9f7

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

lint

d3ae22a

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

fix test

cd9c4a9

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

lint

f0c91d8

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

always overwrite arguments file

b81a789

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

akihikokuroda previously approved these changes Sep 25, 2024

View reviewed changes

Tansito requested review from Tansito and IceKhan13 September 25, 2024 18:15

Tansito reviewed Sep 25, 2024

View reviewed changes

IceKhan13 previously approved these changes Sep 25, 2024

View reviewed changes

Tansito added the on-hold On hold due to any reason label Sep 25, 2024

psschwei added 5 commits September 25, 2024 17:06

set envvar if args less than 1MB

c47e934

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

fix test

19f19e2

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

fix test better

baf5226

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

lint

f18a5d5

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

comma

07e9fa9

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

lint

7118d44

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

psschwei added 5 commits September 25, 2024 20:32

more accurate sizing of args

8c94129

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

lint

e005cf7

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

pass invalid json for envvar args to force failure

0661e7a

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

fix tests

65b5085

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

update arguments when using envvar

d291205

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

psschwei requested review from akihikokuroda and IceKhan13 September 26, 2024 12:53

psschwei removed the on-hold On hold due to any reason label Sep 26, 2024

Tansito self-requested a review September 26, 2024 12:57

Tansito approved these changes Sep 26, 2024

View reviewed changes

psschwei merged commit c0d0409 into Qiskit:main Sep 26, 2024
10 checks passed

psschwei deleted the job-args branch September 26, 2024 13:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide support for large job arguments #1501

Provide support for large job arguments #1501

psschwei commented Sep 24, 2024

psschwei commented Sep 24, 2024

psschwei commented Sep 24, 2024

Tansito commented Sep 24, 2024

akihikokuroda left a comment

Tansito left a comment

psschwei commented Sep 25, 2024

Tansito commented Sep 25, 2024 •

edited

Loading

psschwei commented Sep 25, 2024

Tansito commented Sep 25, 2024

IceKhan13 left a comment

Tansito commented Sep 25, 2024

psschwei commented Sep 25, 2024

psschwei commented Sep 26, 2024

psschwei commented Sep 26, 2024

Tansito left a comment

Provide support for large job arguments #1501

Provide support for large job arguments #1501

Conversation

psschwei commented Sep 24, 2024

Summary

Details and comments

psschwei commented Sep 24, 2024

psschwei commented Sep 24, 2024

Tansito commented Sep 24, 2024

akihikokuroda left a comment

Choose a reason for hiding this comment

Tansito left a comment

Choose a reason for hiding this comment

psschwei commented Sep 25, 2024

Tansito commented Sep 25, 2024 • edited Loading

psschwei commented Sep 25, 2024

Tansito commented Sep 25, 2024

IceKhan13 left a comment

Choose a reason for hiding this comment

Tansito commented Sep 25, 2024

psschwei commented Sep 25, 2024

psschwei commented Sep 26, 2024

psschwei commented Sep 26, 2024

Tansito left a comment

Choose a reason for hiding this comment

Tansito commented Sep 25, 2024 •

edited

Loading