Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiPartCopy with Sync Algorithm #4475

Merged
merged 15 commits into from
Mar 12, 2024

Conversation

bencrabtree
Copy link
Collaborator

@bencrabtree bencrabtree commented Mar 4, 2024

Issue #, if available:

Description of changes:

Implements MultiPartCopy for JumpStart data into the Hub bucket, and uses the S3 Sync algorithm to determine which data should be copied and which can be left. The algorithm is based on s3:sync from the aws-cli repository

Overall CopyContentWorkflow algorithm found at https://quip-amazon.com/SVt5A2n3i1hK/Curated-Hub-Copy-Content-Workflow-in-the-PySDK

Testing done:

Manual testing using the following:

from sagemaker.jumpstart.curated_hub.curated_hub import CuratedHub

HUB_NAME = "test-hub-123"
hub = CuratedHub(hub_name=HUB_NAME)

hub.delete()

hub.create(
    description="this is my hub",
    display_name="Test Hub 123",
)

describe = hub.describe()
print(describe)

hub.sync([{"model_id": "huggingface-llm-mixtral-8x7b-instruct", "version": "1.2.0"}, { "model_id": "model-txt2img-stabilityai-stable-diffusion-v2-1-base"}])

Video: https://drive.corp.amazon.com/documents/bencrab@/multithreading_progress_bar.mov

Most tests added. Need to write unit tests for MultiPartCopyHandler and more from sync algorithm

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

  • I have read the CONTRIBUTING doc
  • I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the Python SDK team
  • I used the commit message format described in CONTRIBUTING
  • I have passed the region in to all S3 and STS clients that I've initialized as part of this change.
  • I have updated any necessary documentation, including READMEs and API docs (if appropriate)

Tests

  • I have added tests that prove my fix is effective or that my feature works (if appropriate)
  • I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes
  • I have checked that my tests are not configured for a specific region or account (if appropriate)
  • I have used unique_name_from_base to create resource names in integ tests (if appropriate)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@mufaddal-rohawala
Copy link
Member

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 344d26b
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@mufaddal-rohawala
Copy link
Member

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-local-mode-tests
  • Commit ID: 344d26b
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@mufaddal-rohawala
Copy link
Member

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-notebook-tests
  • Commit ID: 344d26b
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@mufaddal-rohawala
Copy link
Member

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-slow-tests
  • Commit ID: 344d26b
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@mufaddal-rohawala
Copy link
Member

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-pr
  • Commit ID: 344d26b
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

return files

@format.register
def _(self, file_input: JumpStartModelSpecs) -> List[FileInfo]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a way of implementing the @singledispatch function above. I wanted the .format function to take in one of two params and perform different actions. Essentially if/else block but this was nicer with input types

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.s3_client = s3_client
self.studio_specs = studio_specs

@singledispatchmethod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen this annotation before, I assume it's lifted from aws s3 sync. Can we use a simpler implementation without these fancy annotations? Unless they're absolutely necessary, I feel like they make maintainability more difficult.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya, this is just a really fancy way of implementing an if/else block. The true right way of doing this is through polymorphism where the incoming objects all implement a common interface (in this case, they all would define a method called .format() that you would be able to call). If that requires significant refactor, then if/else (or a case/switch in higher versions of python) is probably the better way to go

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the input I'll take a look at this. I wanted to take in a single input field that could be one of two types. Realistically, I can have two optional input fields and assert that at least one must be defined. Then I can branch my logic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Ben on this, I believe this could be a good practice long-term to avoid bloating our functions with optional fields. I'd argue that singledispatchmethod achieves polymorphism in a functional rather than OOP style. Instead of having a Factory that creates multiple classes that implement format(), Ben's implementation here cuts down on class boilerplate by method overloading format(input) directly with single-line decorators. IMO it's confusing because it's a new paradigm we're not used to yet, not because it's a bad way to implement it

public_model_data_accessor = PublicModelDataAccessor(
region=self.region, model_specs=file_input, studio_specs=studio_specs
)
function_table = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we rename, hub_content_dependency_to_accessor_dict or something to that effect? And can we put in another common module?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use the instantiated PublicModelDataAccessor, but I can likely store that as a constant in the class and use that to map to the function

src/sagemaker/jumpstart/curated_hub/accessors/fileinfo.py Outdated Show resolved Hide resolved


class FileSync:
"""Something."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed

src_files = file_generator.format(model_specs)
dest_files = file_generator.format(dest_location)

files_to_copy = list(FileSync(src_files, dest_files, dest_location).call())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a weird pattern with list(FileSync(src_files, dest_files, dest_location).call())? Can we simplify this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm removing the list in my next revision

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but even call() sounds weird, can we rename the function to something more descriptive? Maybe execute?

Comment on lines +300 to +302
hub_content_display_name="",
hub_content_description="",
hub_content_markdown="",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we set to None instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are just placeholders

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, but empty string can fail validation later

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack - but we should pass the whole content of the markdown file associated with the model here.

from sagemaker.jumpstart.types import JumpStartDataHolderType


class HubContentDocument_v2(JumpStartDataHolderType):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider making a separate module for hub content document schemas, as there may be many

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, this is a placeholder though while @jinyoung-lim works on the implementation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider also renaming HubContentDocumentV2

self.s3_client = s3_client
self.studio_specs = studio_specs

@singledispatchmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya, this is just a really fancy way of implementing an if/else block. The true right way of doing this is through polymorphism where the incoming objects all implement a common interface (in this case, they all would define a method called .format() that you would be able to call). If that requires significant refactor, then if/else (or a case/switch in higher versions of python) is probably the better way to go

contents = response.get("Contents", None)

if not contents:
print("Nothing to download")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are fine with regular print statements?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Placeholder for now, I'll double check if we want to use logger (prob the case)

src/sagemaker/jumpstart/curated_hub/curated_hub.py Outdated Show resolved Hide resolved
pass

# 2. Invalid model version exists in Hub, pass
# This will only happen if something goes wrong in our metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if someone wants to downgrade a model version?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't delete models from the Hub for them. They will have to manually delete the model, and call sync with the pinned version

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, well how about if they want to just import an older version?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should consider at least logging the skip and the skip reason for customers

if matched_model.version < version:
# Check minSDKVersion against current SDK version, emit log
models_to_sync.append(model)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we accept the use case of someone wanting to downgrade a model version, I think it would high level simplify down to models_to_sync = set(hub_models) - set(model_list) -> feed the results into the executor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, it seems like a good candidate for set operations Ben. Right now, you're performing list search in a loop, which is a recipe for trouble.

src/sagemaker/jumpstart/curated_hub/curated_hub.py Outdated Show resolved Hide resolved
src/sagemaker/jumpstart/curated_hub/accessors/sync.py Outdated Show resolved Hide resolved
@mufaddal-rohawala
Copy link
Member

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 374c638
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@mufaddal-rohawala
Copy link
Member

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-notebook-tests
  • Commit ID: 374c638
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@mufaddal-rohawala
Copy link
Member

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-local-mode-tests
  • Commit ID: 374c638
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@mufaddal-rohawala
Copy link
Member

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-slow-tests
  • Commit ID: 374c638
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@mufaddal-rohawala
Copy link
Member

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-pr
  • Commit ID: 374c638
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@mufaddal-rohawala
Copy link
Member

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 67d8ec8
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@mufaddal-rohawala
Copy link
Member

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 30c2b91
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@mufaddal-rohawala
Copy link
Member

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-local-mode-tests
  • Commit ID: 30c2b91
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@mufaddal-rohawala
Copy link
Member

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-notebook-tests
  • Commit ID: 30c2b91
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@mufaddal-rohawala
Copy link
Member

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-pr
  • Commit ID: 30c2b91
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@mufaddal-rohawala
Copy link
Member

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-slow-tests
  • Commit ID: 30c2b91
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@mufaddal-rohawala
Copy link
Member

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 30c2b91
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Copy link

codecov bot commented Mar 5, 2024

Codecov Report

Attention: Patch coverage is 81.15578% with 75 lines in your changes are missing coverage. Please review.

❗ No coverage uploaded for pull request base (master-jumpstart-curated-hub@352a5c1). Click here to learn what that means.

Files Patch % Lines
src/sagemaker/jumpstart/curated_hub/curated_hub.py 75.00% 31 Missing ⚠️
...r/jumpstart/curated_hub/accessors/multipartcopy.py 36.95% 29 Missing ⚠️
.../jumpstart/curated_hub/accessors/file_generator.py 80.95% 8 Missing ⚠️
src/sagemaker/jumpstart/curated_hub/types.py 92.50% 3 Missing ⚠️
src/sagemaker/jumpstart/cache.py 0.00% 2 Missing ⚠️
...sagemaker/jumpstart/curated_hub/sync/comparator.py 96.00% 1 Missing ⚠️
...rc/sagemaker/jumpstart/curated_hub/sync/request.py 98.21% 1 Missing ⚠️
Additional details and impacted files
@@                       Coverage Diff                       @@
##             master-jumpstart-curated-hub    #4475   +/-   ##
===============================================================
  Coverage                                ?   87.03%           
===============================================================
  Files                                   ?      396           
  Lines                                   ?    36477           
  Branches                                ?        0           
===============================================================
  Hits                                    ?    31749           
  Misses                                  ?     4728           
  Partials                                ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bencrabtree bencrabtree changed the title first pass at sync function with util classes MultiPartCopy with Sync Algorithm Mar 6, 2024
@@ -33,26 +33,19 @@ def __init__(
self.s3_client = s3_client
self.studio_specs = studio_specs

@singledispatchmethod
def format(self, file_input) -> List[FileInfo]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this?

Comment on lines 79 to 92
for dependency in HubContentDependencyType:
location = function_table[dependency]()
parameters = {"Bucket": location.bucket, "Prefix": location.key}
response = self.s3_client.head_object(**parameters)
key: str = location.key
size: bytes = response.get("ContentLength", None)
last_updated: str = response.get("LastModified", None)
dependency_type: HubContentDependencyType = dependency
files.append(FileInfo(key, size, last_updated, dependency_type))
location: S3ObjectLocation = public_model_data_accessor.get_s3_reference(dependency)

# Prefix
if location.key[-1] == "/":
parameters = {"Bucket": location.bucket, "Prefix": location.key}
response = self.s3_client.list_objects_v2(**parameters)
contents = response.get("Contents", None)
for s3_obj in contents:
key: str = s3_obj.get("Key")
size: bytes = s3_obj.get("Size", None)
last_modified: datetime = s3_obj.get("LastModified", None)
dependency_type: HubContentDependencyType = dependency
files.append(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too much indentation here, can we breakup into helper functions?

@mufaddal-rohawala
Copy link
Member

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-notebook-tests
  • Commit ID: ce73f62
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@mufaddal-rohawala
Copy link
Member

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-slow-tests
  • Commit ID: ce73f62
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@mufaddal-rohawala
Copy link
Member

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-pr
  • Commit ID: ce73f62
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@mufaddal-rohawala
Copy link
Member

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: ce73f62
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Comment on lines +70 to +78
if location_type == "prefix":
parameters = {"Bucket": location.bucket, "Prefix": location.key}
response = s3_client.list_objects_v2(**parameters)
contents = response.get("Contents")
for s3_obj in contents:
key = s3_obj.get("Key")
size = s3_obj.get("Size")
last_modified = s3_obj.get("LastModified")
files.append(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nonblocking: lot of indentation depth here, consider moving to helper function

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i also see some duplicated code

client=self.s3_client, config=transfer_config
)

def _copy_file(self, file: FileInfo, progress_cb):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add typing to progress_cb

Sets up progress bar and kicks off each copy request.
"""
total_size = sum([file.size for file in self.files])
JUMPSTART_LOGGER.warning(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think should be a warning. can we modify JUMPSTART_LOGGER so that if you doJUMPSTART_LOGGER.info("blah", stdout=True), then it gets printed? warning gives a negative connotation

@property
def default_training_dataset_s3_reference(self):
"""Retrieves s3 reference for s3 directory containing model training datasets"""
return S3ObjectLocation(self._get_bucket_name(), self.__get_training_dataset_prefix())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are there 2 underscores for self.__get_training_dataset_prefix?

Comment on lines +77 to +85

@property
def demo_notebook_s3_reference(self):
"""Retrieves s3 reference for model demo jupyter notebook"""
framework = self.model_specs.get_framework()
key = f"{framework}-notebooks/{self.model_specs.model_id}-inference.ipynb"
return S3ObjectLocation(self._get_bucket_name(), key)

@property
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im worried we could change the s3 file organization and this would break

self._s3_client = self._get_s3_client()
self.hub_storage_location = self._generate_hub_storage_location(bucket_name)

def _get_s3_client(self) -> BaseClient:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, shouldn't this come from self._sagemaker_session?

label=dest_location.key,
).execute()
else:
JUMPSTART_LOGGER.warning("Nothing to copy for %s v%s", model.model_id, model.version)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

questionable use of warning here

f"{JUMPSTART_HUB_MODEL_ID_TAG_PREFIX}:{model.model_id}",
f"{JUMPSTART_HUB_MODEL_VERSION_TAG_PREFIX}:{model.version}",
f"{FRAMEWORK_TAG_PREFIX}:{model_specs.get_framework()}",
f"{TASK_TAG_PREFIX}:TODO: pull from specs",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

from sagemaker.jumpstart.curated_hub.types import FileInfo


class BaseComparator:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider removing, it's not used now, only helpful if there's >1 comparator

self.s3_client = s3_client
self.studio_specs = studio_specs

@singledispatchmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Ben on this, I believe this could be a good practice long-term to avoid bloating our functions with optional fields. I'd argue that singledispatchmethod achieves polymorphism in a functional rather than OOP style. Instead of having a Factory that creates multiple classes that implement format(), Ben's implementation here cuts down on class boilerplate by method overloading format(input) directly with single-line decorators. IMO it's confusing because it's a new paradigm we're not used to yet, not because it's a bad way to implement it

return files

@format.register
def _(self, file_input: JumpStartModelSpecs) -> List[FileInfo]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Args:
region (str): Region for the S3 Client
sync_request (HubSyncRequest): sync request object containing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Seems like the docstring is out of date?

self.region = region
self.files = sync_request.files
self.dest_location = sync_request.destination
self.thread_num = thread_num
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q: This value seems to be only used in tqdm, but from seeing this field I would assume it would control the thread count for s3transfer. Was thread_count = 20 the intent or am I misreading?

pass

# 2. Invalid model version exists in Hub, pass
# This will only happen if something goes wrong in our metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should consider at least logging the skip and the skip reason for customers


# Model does not exist in Hub, sync
if not matched_model:
models_to_sync.append(model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: continue?

if not matched_model:
models_to_sync.append(model)

if matched_model:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: No need to check + indent since if not matched_model: exists above

sync that model.
"""
models_to_sync = []
for model in model_list:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Consider using python set .filter() to remove models that shouldn't be synced

class HubContentDocument_v2(JumpStartDataHolderType):
"""Data class for HubContentDocument v2.0.0"""

SCHEMA_VERSION = "2.0.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q: Would we make a new class for each schema version?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q: For the actual implementation, consider using a builder() mechanism to gate what arguments can be passed in based off version

@liujiaorr liujiaorr merged commit 3d08909 into aws:master-jumpstart-curated-hub Mar 12, 2024
6 of 8 checks passed
bencrabtree added a commit to bencrabtree/sagemaker-python-sdk that referenced this pull request Mar 13, 2024
* first pass at sync function with util classes

* adding tests and update clases

* linting

* file generator class inheritance

* lint

* multipart copy and algorithm updates

* modularize sync

* reformatting folders

* testing for sync

* do not tolerate vulnerable

* remove prints

* handle multithreading progress bar

* update tests

* optimize function and add hub bucket prefix

* docstrings and linting
bencrabtree added a commit to bencrabtree/sagemaker-python-sdk that referenced this pull request Mar 13, 2024
* first pass at sync function with util classes

* adding tests and update clases

* linting

* file generator class inheritance

* lint

* multipart copy and algorithm updates

* modularize sync

* reformatting folders

* testing for sync

* do not tolerate vulnerable

* remove prints

* handle multithreading progress bar

* update tests

* optimize function and add hub bucket prefix

* docstrings and linting
benieric added a commit that referenced this pull request Mar 15, 2024
* prepare release v2.210.0

* update development version to v2.210.1.dev0

* feat: Add new Triton DLC URIs (#4432)

* Add new Triton DLC URIs

* Update according to black and pylint

* feat: Support selective pipeline execution between function step and regular step (#4392)

* feat: Add AutoMLV2 support (#4461)

* Add AutoMLV2 support

* Improvements of the integration tests

---------

Co-authored-by: Anton Repushko <repuanto@amazon.com>

* feature: Add TensorFlow 2.14 image configs (#4446)

* fix: remove enable_network_isolation from the python doc (#4465)

Co-authored-by: Rohan Gujarathi <gujrohan@amazon.com>

* doc: Add doc for new feature processor APIs and classes (#4250)

* fix: properly close sagemaker config file after loading config (#4457)

Closes #4456

* feat: instance specific jumpstart host requirements (#4397)

* feat: instance specific jumpstart host requirements

* chore: add js support for copies resource requirement, enforce coupling with ResourceRequirements class

* fix: typing

* fix: pylint

* change: Bump Apache Airflow version to 2.8.2 (#4470)

* Update tox.ini

* Update test_requirements.txt

* fix: make sure gpus are found in local_gpu run (#4384)

* fix: make sure gpus are found in local_gpu run

* fix: black formatting

* fix: adjust unit test

* feat: pin dll version to support python3.11 to the sdk (#4472)

Co-authored-by: Ashwin Krishna <ashwikri@amazon.com>

* fix: Skip No Canvas regions for test_deploy_best_candidate (#4477)

* prepare release v2.211.0

* update development version to v2.211.1.dev0

* change: Enhance model builder selection logic to include model size (#4429)

* change: Enhance model builder selection logic to include model size

* Fix conflicts

* Address PR comments

* fix formatting

* fix formatting of test

* Fix token in tasks.json

* Increase coverage for tests

* fix formatting

* Fix requirements

* Import code instead of importing accelerate

* Fix formatting

* Setup dependencies

* change: Upgrade smp to version 2.2 (#4479)

* upgrading smp to version 2.2

* fixing linting issue

* fixing syntax error with multiline if statement

* upgrading smp to version 2.2

* fixing linting issue

* fixing syntax error with multiline if statement

* fixing formatting

---------

Co-authored-by: Andrew Tian <tinandr@amazon.com>

* feat: Update SM Python SDK for PT 2.2.0 SM DLC (#4481)

* update pt2.2 sm training dlc pysdk

* update pt2.2 sm inference dlc pysdk and region list

* fix: Create custom tarfile extractall util to fix backward compatibility issue (#4476)

* fix: Create custom tarfile extractall util to fix backward compatibility issue

* Address review comments

* fix logger.error statements

* prepare release v2.212.0

* update development version to v2.212.1.dev0

* change: Update tblib constraint (#4452)

* fix: make unit tests compatible with pytest-xdist (#4486)

* fix: make unit tests compatible with pytest-xdist

* fix failing test

* feature: Add overriding logic in ModelBuilder when task is provided (#4460)

* feat: Add Optional task to Model

* Revert "feat: Add Optional task to Model"

This reverts commit fd3e86b.

* Add override logic in ModelBuilder with task provided

* Adjusted formatting

* Add extra unit tests for invalid inputs

* Address PR comments

* Add more test inputs to integration test

* Add model_metadata field to ModelBuilder

* Update doc

* Update doc

* Adjust formatting

---------

Co-authored-by: Samrudhi Sharma <samruds@amazon.com>
Co-authored-by: Xiong Zeng <xionzeng@amazon.com>

* feature: Accept user-defined env variables for the entry-point (#4175)

* fix: Move sagemaker pysdk version check after bootstrap in remote job (#4487)

* change: enable github actions for PRs (#4489)

* change: enable github actions for PRs

* Update codebuild-ci.yml

* trigger on pull_request_target

* add source-version-override

* fix permission

* feature: Add ModelDataSource and SourceUri support for model package and while registering (#4492)

Co-authored-by: Erick Benitez-Ramos <141277478+benieric@users.noreply.github.com>

* feat: support JumpStart proprietary models (#4467)

* feat: add proprietary manifest/specs parsing

add unittests for test_cache

small refactoring

address comments and more unittests

fix linting and fix more tests

fix: pylint

feat: JumpStartModel class for prop models

* remove unused imports and fix docstyle

* fix: remove unused args

* fix: remove unused args

* fix: more unused vars

* fix: slow tests

* fix: unittests

* added more tests to cover some lines

* remove estimator warn check

* chore: address comments re performance

* fix: address comments

* complete list experience and other fixes

* fix: pylint

* add doc utils and fix pylint

* fix: docstyle

* fix: doc

* fix: default payloads

* fix: doc and tags and enums

* fix: jumpstart doc

* rename to open_weights and fix filtering

* update filter name

* doc update

* fix: black

* rename to proprietary model and fix unittests

* address comments

* fix: docstyle and flake8

* address more comments and fix doc

* put back doc utils for future refactoring

* add prop model title in doc

* doc update

---------

Co-authored-by: liujiaor <128006184+liujiaorr@users.noreply.github.com>

* chore: emit warning when no instance specific gated training env var is available, and raise exception when accept_eula flag is not supplied (#4485)

* fix: raise exception when no instance specific gated training env var available

* chore: raise client exception if accept_eula flag is not set for gated models

* chore: address flake8 errors

* chore: emit warning when instance type is chosen with no gated training artifacts

* change: bump jinja2 to 3.1.3 in doc/requirments.txt (#4421) (#4423)

* change: bump jinja2 to 3.1.3 in doc/requirments.txt (#4421)

* change: bump jinja2 to 3.1.3 in doc/requirments.txt

* Update requirements.txt

* feature: TGI 1.4.0 (#4424)

* documentation: fix the ClarifyCheckStep documentation to mention PDP (#4259)

* documentation: fix the ClarifyCheckStep documentation to mention PDP support

* fix: break the lines to meet pylint requirement

---------

Co-authored-by: Shing Lyu <shinglyu@amazon.nl>

* documentation: Explain the ClarifyCheckStep and QualityCheckStep parameters (#4261)

* documentation: explain the ClarifyCheckStep and QualityCheckStep parameters

* fix: remove trailing space

---------

Co-authored-by: Shing Lyu <shinglyu@amazon.nl>

* feat: Telemetry metrics (#4414)

* Emit additional telemetry metrics

* Fix unit tests

* Emit endpoint failure to telemetry

* Address PR Comments

* Emit latency in telemetry

* Address PR Comments

* Addressed PR Comments

* Address PR Comments

* Fix tests

* Fix integ tests

---------

Co-authored-by: Jonathan Makunga <makung@amazon.com>
Co-authored-by: Erick Benitez-Ramos <141277478+benieric@users.noreply.github.com>

* documentation: change order of pipelines topics (#4427)

* prepare release v2.208.0

* update development version to v2.208.1.dev0

* feature: AutoGluon 1.0.0 image_uris update (#4426)

---------

Co-authored-by: Erick Benitez-Ramos <141277478+benieric@users.noreply.github.com>
Co-authored-by: Jinyoung Lim <jj.lim418@gmail.com>
Co-authored-by: Shing Lyu <shing.lyu@gmail.com>
Co-authored-by: Shing Lyu <shinglyu@amazon.nl>
Co-authored-by: Jonathan Makunga <54963715+makungaj1@users.noreply.github.com>
Co-authored-by: Jonathan Makunga <makung@amazon.com>
Co-authored-by: stacicho <stacicho@amazon.com>
Co-authored-by: ci <ci>
Co-authored-by: tonyhu <tonyhoo@users.noreply.github.com>

* feat: add hub and hubcontent support in retrieval function for jumpstart model cache (#4438)

* feat: jsch jumpstart estimator support (#4439)

* Master jumpstart curated hub (#4464)

* add hub_arn support for accept_types, content_types, serializers, deserializers, and predictor (#4463)

* feature: JumpStart CuratedHub class creation and function definitions (#4448)

* MultiPartCopy with Sync Algorithm (#4475)

* first pass at sync function with util classes

* adding tests and update clases

* linting

* file generator class inheritance

* lint

* multipart copy and algorithm updates

* modularize sync

* reformatting folders

* testing for sync

* do not tolerate vulnerable

* remove prints

* handle multithreading progress bar

* update tests

* optimize function and add hub bucket prefix

* docstrings and linting

* rebase with master

* bad rebase

* trying to fix codecov

* uncomment codebuild-ci

---------

Co-authored-by: ci <ci>
Co-authored-by: Nikhil Kulkarni <knikhil29@gmail.com>
Co-authored-by: qidewenwhen <32910701+qidewenwhen@users.noreply.github.com>
Co-authored-by: Anton Repushko <repushko.a@gmail.com>
Co-authored-by: Anton Repushko <repuanto@amazon.com>
Co-authored-by: Sai Parthasarathy Miduthuri <54188298+saimidu@users.noreply.github.com>
Co-authored-by: Rohan Gujarathi <gujarathi.rohan@gmail.com>
Co-authored-by: Rohan Gujarathi <gujrohan@amazon.com>
Co-authored-by: cansun <80425164+can-sun@users.noreply.github.com>
Co-authored-by: Justin <justinm088@hotmail.com>
Co-authored-by: evakravi <69981223+evakravi@users.noreply.github.com>
Co-authored-by: Kalyani Nikure <110067132+knikure@users.noreply.github.com>
Co-authored-by: gv <gverkes@users.noreply.github.com>
Co-authored-by: akrishna1995 <38850354+akrishna1995@users.noreply.github.com>
Co-authored-by: Ashwin Krishna <ashwikri@amazon.com>
Co-authored-by: Samrudhi Sharma <154457034+samruds@users.noreply.github.com>
Co-authored-by: adtian2 <55163384+adtian2@users.noreply.github.com>
Co-authored-by: Andrew Tian <tinandr@amazon.com>
Co-authored-by: Sirut Buasai <73297481+sirutBuasai@users.noreply.github.com>
Co-authored-by: Danny Bushkanets <d.bushkanets@gmail.com>
Co-authored-by: Erick Benitez-Ramos <141277478+benieric@users.noreply.github.com>
Co-authored-by: xiongz945 <54782408+xiongz945@users.noreply.github.com>
Co-authored-by: Samrudhi Sharma <samruds@amazon.com>
Co-authored-by: Xiong Zeng <xionzeng@amazon.com>
Co-authored-by: martinRenou <martin.renou@gmail.com>
Co-authored-by: mrudulmn <161017394+mrudulmn@users.noreply.github.com>
Co-authored-by: Haotian An <33510317+Captainia@users.noreply.github.com>
Co-authored-by: liujiaor <128006184+liujiaorr@users.noreply.github.com>
Co-authored-by: Jinyoung Lim <jj.lim418@gmail.com>
Co-authored-by: Shing Lyu <shing.lyu@gmail.com>
Co-authored-by: Shing Lyu <shinglyu@amazon.nl>
Co-authored-by: Jonathan Makunga <54963715+makungaj1@users.noreply.github.com>
Co-authored-by: Jonathan Makunga <makung@amazon.com>
Co-authored-by: stacicho <stacicho@amazon.com>
Co-authored-by: tonyhu <tonyhoo@users.noreply.github.com>
bencrabtree added a commit to bencrabtree/sagemaker-python-sdk that referenced this pull request Mar 18, 2024
* first pass at sync function with util classes

* adding tests and update clases

* linting

* file generator class inheritance

* lint

* multipart copy and algorithm updates

* modularize sync

* reformatting folders

* testing for sync

* do not tolerate vulnerable

* remove prints

* handle multithreading progress bar

* update tests

* optimize function and add hub bucket prefix

* docstrings and linting
benieric pushed a commit that referenced this pull request Mar 18, 2024
* fix: Move sagemaker pysdk version check after bootstrap in remote job (#4487)

* feat: support JumpStart proprietary models (#4467)

* feat: add proprietary manifest/specs parsing

add unittests for test_cache

small refactoring

address comments and more unittests

fix linting and fix more tests

fix: pylint

feat: JumpStartModel class for prop models

* remove unused imports and fix docstyle

* fix: remove unused args

* fix: remove unused args

* fix: more unused vars

* fix: slow tests

* fix: unittests

* added more tests to cover some lines

* remove estimator warn check

* chore: address comments re performance

* fix: address comments

* complete list experience and other fixes

* fix: pylint

* add doc utils and fix pylint

* fix: docstyle

* fix: doc

* fix: default payloads

* fix: doc and tags and enums

* fix: jumpstart doc

* rename to open_weights and fix filtering

* update filter name

* doc update

* fix: black

* rename to proprietary model and fix unittests

* address comments

* fix: docstyle and flake8

* address more comments and fix doc

* put back doc utils for future refactoring

* add prop model title in doc

* doc update

---------

Co-authored-by: liujiaor <128006184+liujiaorr@users.noreply.github.com>

* feat: add hub and hubcontent support in retrieval function for jumpstart model cache (#4438)

* feat: jsch jumpstart estimator support (#4439)

* Master jumpstart curated hub (#4464)

* add hub_arn support for accept_types, content_types, serializers, deserializers, and predictor (#4463)

* feature: JumpStart CuratedHub class creation and function definitions (#4448)

* MultiPartCopy with Sync Algorithm (#4475)

* first pass at sync function with util classes

* adding tests and update clases

* linting

* file generator class inheritance

* lint

* multipart copy and algorithm updates

* modularize sync

* reformatting folders

* testing for sync

* do not tolerate vulnerable

* remove prints

* handle multithreading progress bar

* update tests

* optimize function and add hub bucket prefix

* docstrings and linting

* rebase with master

* bad rebase

* support for gated and training unsupported

* merge with master-curated-jumpstart

* linting

* update types

* update

* update bootstrap

* fix codecov

---------

Co-authored-by: qidewenwhen <32910701+qidewenwhen@users.noreply.github.com>
Co-authored-by: Haotian An <33510317+Captainia@users.noreply.github.com>
Co-authored-by: liujiaor <128006184+liujiaorr@users.noreply.github.com>
Co-authored-by: Jinyoung Lim <jj.lim418@gmail.com>
bencrabtree added a commit to bencrabtree/sagemaker-python-sdk that referenced this pull request Mar 20, 2024
* first pass at sync function with util classes

* adding tests and update clases

* linting

* file generator class inheritance

* lint

* multipart copy and algorithm updates

* modularize sync

* reformatting folders

* testing for sync

* do not tolerate vulnerable

* remove prints

* handle multithreading progress bar

* update tests

* optimize function and add hub bucket prefix

* docstrings and linting
bencrabtree added a commit to bencrabtree/sagemaker-python-sdk that referenced this pull request Mar 21, 2024
* first pass at sync function with util classes

* adding tests and update clases

* linting

* file generator class inheritance

* lint

* multipart copy and algorithm updates

* modularize sync

* reformatting folders

* testing for sync

* do not tolerate vulnerable

* remove prints

* handle multithreading progress bar

* update tests

* optimize function and add hub bucket prefix

* docstrings and linting
bencrabtree added a commit to bencrabtree/sagemaker-python-sdk that referenced this pull request Mar 23, 2024
* first pass at sync function with util classes

* adding tests and update clases

* linting

* file generator class inheritance

* lint

* multipart copy and algorithm updates

* modularize sync

* reformatting folders

* testing for sync

* do not tolerate vulnerable

* remove prints

* handle multithreading progress bar

* update tests

* optimize function and add hub bucket prefix

* docstrings and linting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants