Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spot/Serve] Optimize the translation of filemounts #4016

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

warrickhe
Copy link

@warrickhe warrickhe commented Sep 28, 2024

(Work in Progress) PR to fix #3188. Added compression to local file -> intermediate bucket translation. Still working on the decompression side.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@warrickhe warrickhe changed the title add compression to local files [Spot/Serve] Optimize the translation of filemounts Sep 28, 2024
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this feature @warrickhe ! It would be really helpful for us to reduce the file upload latency. Left some discussion ; )

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/data/storage.py Show resolved Hide resolved
sky/data/storage.py Outdated Show resolved Hide resolved
sky/data/storage.py Outdated Show resolved Hide resolved
sky/data/storage.py Outdated Show resolved Hide resolved
sky/data/storage.py Outdated Show resolved Hide resolved
sky/task.py Outdated Show resolved Hide resolved
sky/utils/controller_utils.py Outdated Show resolved Hide resolved
sky/task.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting this PR @warrickhe! This is awesome.

Haven't looked into the code, but just a reminder in case we have missed it in our implementation: our original behavior for rsync will exclude files specified with .gitignore. Just curious, if we have this behavior in our compression logic as well. : )

@warrickhe
Copy link
Author

Haven't looked into the code, but just a reminder in case we have missed it in our implementation: our original behavior for rsync will exclude files specified with .gitignore. Just curious, if we have this behavior in our compression logic as well. : )

Good point! I will modify the code along with Tian's suggestion to maintain the original behaviour .

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@warrickhe Thanks for the prompt fix! I tried this with sky serve (examples/serve/http_server/task.yaml, I forcefully enabled the compress feature.) but seems like i got a lot of zipped files in the cloud bucket, while expected to only get one. Not sure if it has anything to do with the other file mounts. Could you please take a look on this?

image
I 10-08 11:51:15 backend_utils.py:1343] Syncing (to 1 node): /tmp/service-task-new-http-2ok7ehj6 -> ~/.sky/serve/new_http/task.yaml.tmp
I 10-08 11:51:16 backend_utils.py:1343] Syncing (to 1 node): /tmp/tmp3oxekj8x -> ~/.sky/serve/new_http/config.yaml
I 10-08 11:51:16 backend_utils.py:1343] Syncing (to 1 node): /home/txia/.sky/catalogs/v5/aws/images.csv -> ~/.sky/catalogs/v5/aws/images.csv
I 10-08 11:51:16 backend_utils.py:1343] Syncing (to 1 node): /home/txia/.sky/catalogs/v5/aws/az_mappings-6d23a727.csv -> ~/.sky/catalogs/v5/aws/az_mappings-6d23a727.csv
I 10-08 11:51:17 backend_utils.py:1343] Syncing (to 1 node): /home/txia/.sky/catalogs/v5/azure/images.csv -> ~/.sky/catalogs/v5/azure/images.csv
I 10-08 11:51:17 backend_utils.py:1343] Syncing (to 1 node): /home/txia/.sky/catalogs/v5/azure/vms.csv -> ~/.sky/catalogs/v5/azure/vms.csv
I 10-08 11:51:17 backend_utils.py:1343] Syncing (to 1 node): /home/txia/.sky/catalogs/v5/azure/az_ml_vms.csv -> ~/.sky/catalogs/v5/azure/az_ml_vms.csv
I 10-08 11:51:18 backend_utils.py:1343] Syncing (to 1 node): /home/txia/.sky/catalogs/v5/gcp/vms.csv -> ~/.sky/catalogs/v5/gcp/vms.csv

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/data/storage.py Outdated Show resolved Hide resolved
sky/data/storage.py Show resolved Hide resolved
sky/data/storage.py Outdated Show resolved Hide resolved
sky/data/storage.py Outdated Show resolved Hide resolved
sky/task.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
@warrickhe
Copy link
Author

warrickhe commented Oct 10, 2024

I forcefully enabled the compress feature.) but seems like i got a lot of zipped files in the cloud bucket, while expected to only get one. Not sure if it has anything to do with the other file mounts.

Could you show how you forcefully enabled compression in so I may try to replicate this issue? It's seems like that my code is trying to compress every file individually, which leads me to believe that the way you forcefully enable compression may have been incorrect. My code could also be going a bit haywire, but it's hard to say without access to the bucket and workspace. What is confusing is how there are 20 files when the workdir should only have 2, I will try to dig into this
As for the filemount concern, I don't think that would be the cause since the task.yaml only has a workdir.

@cblmemo
Copy link
Collaborator

cblmemo commented Oct 10, 2024

I forcefully enabled the compress feature.) but seems like i got a lot of zipped files in the cloud bucket, while expected to only get one. Not sure if it has anything to do with the other file mounts.

Could you show how you forcefully enabled compression in so I may try to replicate this issue? It's seems like that my code is trying to compress every file individually, which leads me to believe that the way you forcefully enable compression may have been incorrect. My code could also be going a bit haywire, but it's hard to say without access to the bucket and workspace. What is confusing is how there are 20 files when the workdir should only have 2, I will try to dig into this As for the filemount concern, I don't think that would be the cause since the task.yaml only has a workdir.

I hardcode the following 2 if conditions to True:

image image

@warrickhe
Copy link
Author

I hardcode the following 2 if conditions to True:

image image

This shouldn't be right. Not sure how you're getting compressed files at all, the second True will turn off all compression. We enable compression only if num_files(self.source, excluded_list) > 10, so you actually want to set that to False.

sky/task.py Outdated Show resolved Hide resolved
sky/data/storage.py Outdated Show resolved Hide resolved
sky/data/storage.py Outdated Show resolved Hide resolved
sky/data/storage.py Outdated Show resolved Hide resolved
Comment on lines 1015 to 1017
if file.startswith('skypilot-filemounts-'
) and file.endswith('.tar.gz'):
logger.warning(f'Consider renaming the file: {file}')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should directly raise an error here since IIUC this will essentially remove the user file by override them?

Copy link
Author

@warrickhe warrickhe Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might it be better to leave it as a warning and call _sync_store from here, so the run will complete?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Spot/Serve] Optimize the translation of filemounts
3 participants