-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PLA-699][external] Allow registration of multi-slotted read-write files from external storage #790
Conversation
@@ -189,7 +189,7 @@ def get_remote_dataset( | |||
parsed_dataset_identifier.team_slug = self.default_team | |||
|
|||
try: | |||
matching_datasets: List[RemoteDataset] = [ | |||
matching_datasets: List[RemoteDatasetV2] = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These type changes aren't functionally significant, but since we no longer have RemoteDatasetV1
they make some typechecker errors less likely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are still some places (docstrings/types) where we use RemoteDataset
- should we change all of them 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, although I don't think it's a high priority change. I made the change here because I noticed Pylance type checking complaining, but didn't comb the rest of the repo
If you agree, I'd like to do that at some point down the line
darwin/dataset/remote_dataset_v2.py
Outdated
if not isinstance(storage_keys, dict) or not all( | ||
isinstance(v, list) and all(isinstance(i, str) for i in v) | ||
for v in storage_keys.values() | ||
): | ||
raise ValueError( | ||
"storage_keys must be a dictionary with keys as item names and values as lists of storage keys." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pydantic models could be very efficient here! - but i guess we only use pydantic in darwin-py 2.0 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really good point. We do, but I don't see a problem with using models here
We could do something like:
class StorageKeysModel(BaseModel):
storage_keys: Dict[str, List[str]] = Field(..., description="A dictionary with keys as item names and values as lists of storage keys.")
and then:
try:
validated_input = StorageKeysModel(storage_keys=storage_keys)
except ValidationError as e:
raise ValueError(e)
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, that'd work! 🙏
darwin/dataset/remote_dataset_v2.py
Outdated
# Do not register more than 500 items in a single request | ||
chunk_size = 500 | ||
chunked_items = ( | ||
items[i : i + chunk_size] for i in range(0, len(items), chunk_size) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've seen the chunking/batching function before somewhere - we can create a new fn and re-use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you 🙏
Problem
Currently, the REST API is the only way to register externally stored items. The ability to register items using darwin-py is simpler in some scenarios and can prevent difficult-to-spot payload errors
Solution
Added user-facing methods to register multi-slotted read-write items from external storage
Changelog
Added ability to register multi-slotted read-write files from external storage in darwin-py