-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DAR-3513][DAR-3514][DAR-3515][DAR-3516][DAR-3517][External] Multi-file push
#923
Conversation
push
push
|
||
# Need to add a `data` attribute for MultFileItem? Where do we get `fps` from? | ||
def serialize_v2(self): | ||
optional_properties = ["fps"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional_properties
only contains 1 item for now, but this is to mirror how the serialize_v2()
method works for LocalFile
. If we add support for others such as as_frames
or extract_views
in future, this will be easy to extend
darwin/dataset/upload_manager.py
Outdated
single_file_items = self.local_files | ||
upload_payloads = [] | ||
if self.multi_file_items: | ||
upload_payloads.extend( | ||
[ | ||
{ | ||
"items": [file.serialize_v2() for file in file_chunk], | ||
"options": {"ignore_dicom_layout": True}, | ||
} | ||
for file_chunk in chunk(self.multi_file_items, chunk_size) | ||
] | ||
) | ||
local_files_for_multi_file_items = [ | ||
file | ||
for multi_file_item in self.multi_file_items | ||
for file in multi_file_item.files | ||
] | ||
single_file_items = [ | ||
file | ||
for file in single_file_items | ||
if file not in local_files_for_multi_file_items | ||
] | ||
|
||
upload_payloads.extend( | ||
[ | ||
{"items": [file.serialize_v2() for file in file_chunk]} | ||
for file_chunk in chunk(single_file_items, chunk_size) | ||
] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do it this way so that if in the future, we want to develop a more complex multi-file item upload feature: A single push()
operation is capable of uploading single-file items and multi-file items
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is safe to default ignore_dicom_layout
to true.
filtered_files = [ | ||
f for f in found_files if str(f) not in files_to_exclude_full_paths | ||
] | ||
if sort: | ||
return natsorted(filtered_files) | ||
return filtered_files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do this because when uploading multi-file items, the order of the files is important. Natural sorting seems the more appropriate approach to me
c69d1db
to
b4a52c0
Compare
b4a52c0
to
ce85a68
Compare
push
push
6ee0fad
to
da92554
Compare
e3d261e
to
b5ef828
Compare
darwin/cli_functions.py
Outdated
item.reason, | ||
) | ||
for slot in item.slots: | ||
if slot["reason"] != "ALREADY_EXISTS": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is different from what we do at line 802. Do we expect reason to always be defined and uppercase at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From looking at the relevant BE section, yes - Every blocked item will have a reason, and all the reasons are uppercase. However, I don't see a downside to introduce handling for potentially missing or lowercase reasons, so I can adjust this section to function as above
darwin/cli_functions.py
Outdated
other_skipped_items.append(item) | ||
for slot in item.slots: | ||
if (slot["reason"] is not None) and ( | ||
slot["reason"].upper() == "ALREADY_EXISTS" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this was pre-existing but I fear strings such as "ALREADY_EXISTS"
and "UPLOAD_REQUEST"
it's too easy to make a typo. I usually create a var to contain such constants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I'll move reasons into a const based on their BE values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, "UPLOAD_REQUEST"
doesn't appear to be a BE constant. Instead, it looks like an arbitrarily chosen string to represent the stage of the upload where the error occurred
darwin/dataset/remote_dataset_v2.py
Outdated
merge_incompatible_args = { | ||
"preserve_folders": preserve_folders, | ||
"as_frames": as_frames, | ||
"extract_views": extract_views, | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also here IMO high risk of typo, I'd:
# Define constants for the keys
PRESERVE_FOLDERS_KEY = "preserve_folders"
AS_FRAMES_KEY = "as_frames"
EXTRACT_VIEWS_KEY = "extract_views"
# Use the constants in the dictionary
merge_incompatible_args = {
PRESERVE_FOLDERS_KEY: preserve_folders,
AS_FRAMES_KEY: as_frames,
EXTRACT_VIEWS_KEY: extract_views,
}
darwin/dataset/remote_dataset_v2.py
Outdated
@@ -842,3 +862,136 @@ def register_multi_slotted( | |||
print(f" - {item}") | |||
print(f"Reistration complete. Check your items in the dataset: {self.slug}") | |||
return results | |||
|
|||
|
|||
def _find_files_to_upload_merging( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: _find_files_to_upload_with_merging
. I think it reads a bit better.
Looking at the function below this could also be _find_files_to_upload_as_multislot
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in comment below, there are better, more generic names I'll use
darwin/dataset/remote_dataset_v2.py
Outdated
Finds files to upload as either: | ||
- Multi-slotted items | ||
- Multi-channel items | ||
- Single-slotted items containing multiple `.dcm` files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this too specific? In case we'll add an additional modality we'll have to modify the doc as well. To be fair not a big deal, but it's easy to forget to updated the doc. We could say:
find files to upload accordingly to the item_merge_mode
?
darwin/dataset/remote_dataset_v2.py
Outdated
return local_files, multi_file_items | ||
|
||
|
||
def _find_files_to_upload_no_merging( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_find_files_to_upload_as_singleslot
reads a bit better. Usually I avoid to define something by negating something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about naming the pair of functions:
_find_files_to_upload_as_single_file_items
_find_files_to_upload_as_multi_file_items
This is generic enough that we can add modalities in the future
darwin/dataset/upload_manager.py
Outdated
Creates the layout to be used when uploading the files as a dataset item: | ||
- For multi-slotted items: LayoutV2 | ||
- For series items: LayoutV2, but only with `.dcm` files | ||
- For multi-channel items: LayoutV3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB. LayoutV3 can be used also to upload multi-slot items and series. Maybe it's easier to use a single layout version below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to maintain the ability to upload files with >16 slots, then unfortunately I think we have to stick to LayoutV2
. Confirming with the channels engineers
raise ValueError( | ||
f"No multi-channel item can have more than 16 files. The following directory has {len(self.files)} files: {self.directory}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we get this from the API response? So we don't have to update dpy when changing this on the BE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, yes. The problem is: How do we get the limit from the API? Is there a call we can make to get the limit?
My understanding is that there isn't a limit at the moment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't, basically dpy tries to upload and gets an error from the BE. We could argue it's late in terms of UX, but I've the feeling the limitation will vary in the future and I'm not sure it's a good use of our time to update dpy everytime for such scenario 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see 2 difficulties with this:
- If we get an upload error from the BE that we've exceeded the limit, how do we know what the limit is? Is the error guaranteed to include the limit?
- This section of the code where we're creating each
MultiFileItem
isn't where we perform the upload. We construct all theMultiFileItem
s first, then move to upload afterward. This means that if one item exceeds the limit but others don't, we could end up with a partially completed upload at the point or error
darwin/dataset/upload_manager.py
Outdated
"slots_grid": [[[file.local_path.name for file in self.files]]], | ||
} | ||
|
||
def serialize_v2(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does v2
refer to the layout version? Cause I also see the CHANNELS
mode managed below and that is only v3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v2
here refers to version 2 of Darwin JSON - It's leftover from when darwin-py used to support Darwin JSON 1.0 as well. The function can be renamed to serialize_darwin_json_v2
or something similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that could work 💯
darwin/dataset/upload_manager.py
Outdated
f"Cannot match {item.full_path} from payload with files to upload" | ||
for slot in item.slots: | ||
upload_id = slot["upload_id"] | ||
slot_path = (Path(item.path) / Path((slot["file_name"]))).as_posix() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is slot
a TypedDict? In that case we shouldn't be able to have a typo in file_name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slot
is part of a raw API response so a typo shouldn't be possible, but I'm not sure I follow - Could you expand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you type "file_nameS"
would the linter tell you that it's not a valid value or would we get an error only at runtime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed, I've put in 2 classes for better type safety of API responses
Problem
Currently,
push
only supports single-slotted, single-file items. It should support multi-slotted, multi-channel, and DICOM series itemsSolution
Support for multi-file items is added through the
item_merge_mode
push
command. Details below:item_merge_mode: Enum["slot", "series", "channel"]
It is available through both CLI & SDK-initiated
push()
with this syntax:Because it is incompatible with
preserve_folders
, we raise an error ifpreserve_folders
is set:dataset.push(files_to_upload=files_to_upload, item_merge_mode=item_merge_mode, preserve_folders=True)
Given this, we throw 1 non-blocking warning for each instance of:
push()
that references a specific file rather than a directory, since these are ignoreditem_merge_mode
rules, results in an empty directoryBehaviour of
item_merge_mode
optionsEach file path passed to
push
must be a directory. The files inside the 1st level of that directory are treated as follows:Multi-slotted items
Multi-channel items
DICOM series
.dcm
file is concatenated into the same slotignore_dicom_layout
option to ensure they are concatenated, even if multiple volumes are detectedChangelog
Added
push
support for multi-file items, specifically:.dcm
concatenated files