Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DAR-3513][DAR-3514][DAR-3515][DAR-3516][DAR-3517][External] Multi-file push #923

Merged
merged 13 commits into from
Sep 26, 2024

Conversation

JBWilkie
Copy link
Collaborator

@JBWilkie JBWilkie commented Sep 2, 2024

Problem

Currently, push only supports single-slotted, single-file items. It should support multi-slotted, multi-channel, and DICOM series items

Solution

Support for multi-file items is added through the item_merge_mode push command. Details below:

item_merge_mode: Enum["slot", "series", "channel"]

It is available through both CLI & SDK-initiated push() with this syntax:

# CLI
darwin dataset push /path/to/files --item-merge-mode [slots | series | channels]
# SDK
from darwin.client import Client

team_slug = "team_slug"
dataset_slug = "dataset_slug"
files_to_upload = ["path/to/dir_1", "path/to/dir_2", "...", "path/to/dir_n"]
item_merge_mode = # oneof["slots" | "series" | "channels"]

client = Client.local()
dataset = client.get_remote_dataset(dataset_identifier=f"{team_slug}/{dataset_slug}")
dataset.push(files_to_upload=files_to_upload, item_merge_mode=item_merge_mode)

Because it is incompatible with preserve_folders, we raise an error if preserve_folders is set:
dataset.push(files_to_upload=files_to_upload, item_merge_mode=item_merge_mode, preserve_folders=True)

Given this, we throw 1 non-blocking warning for each instance of:

  • A file path passed to push() that references a specific file rather than a directory, since these are ignored
  • Each folder that after being parsed according to the specified item_merge_mode rules, results in an empty directory

Behaviour of item_merge_mode options

Each file path passed to push must be a directory. The files inside the 1st level of that directory are treated as follows:

Multi-slotted items

  • Each file is assigned it's own slot
  • Dataset items are named after the folder containing the files that result in the item
  • Slots are named as incrementing integers counting from 0
  • The layout shape is set according to the number of files to ensure an optimal layout shape

Multi-channel items

  • Each file is assigned to a channel
  • Dataset items are named after the folder containing the files that result in the item
  • Each slot is named after the file in the slot
  • No file may have more than 16 channels, we throw an error if this occurs

DICOM series

  • Each .dcm file is concatenated into the same slot
  • Other file types are ignored
  • The slot is named after the folder containing the files that result in the item
  • Since we can assume users intend to concatenate input files, we apply the ignore_dicom_layout option to ensure they are concatenated, even if multiple volumes are detected

Changelog

Added push support for multi-file items, specifically:

  • Multi-slotted items
  • Multi-channel items
  • DICOM series consisting of multiple .dcm concatenated files

@wiz-inc-4ad3b29aa7
Copy link

wiz-inc-4ad3b29aa7 bot commented Sep 2, 2024

Wiz Scan Summary

Scan Module Critical High Medium Low Info Total
IaC Misconfigurations 0 0 0 0 0 0
Vulnerabilities 0 2 1 0 0 3
Sensitive Data 0 0 0 0 0 0
Secrets 0 0 0 0 0 0
Total 0 2 1 0 0 3

View scan details in Wiz

@JBWilkie JBWilkie changed the title [DAR-3513][DAR-3514][DAR-3515][DAR-3516] Multi-file push [DAR-3513][DAR-3514][DAR-3515][DAR-3516][DAR-3517] Multi-file push Sep 2, 2024

# Need to add a `data` attribute for MultFileItem? Where do we get `fps` from?
def serialize_v2(self):
optional_properties = ["fps"]
Copy link
Collaborator Author

@JBWilkie JBWilkie Sep 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional_properties only contains 1 item for now, but this is to mirror how the serialize_v2() method works for LocalFile. If we add support for others such as as_frames or extract_views in future, this will be easy to extend

Comment on lines 528 to 555
single_file_items = self.local_files
upload_payloads = []
if self.multi_file_items:
upload_payloads.extend(
[
{
"items": [file.serialize_v2() for file in file_chunk],
"options": {"ignore_dicom_layout": True},
}
for file_chunk in chunk(self.multi_file_items, chunk_size)
]
)
local_files_for_multi_file_items = [
file
for multi_file_item in self.multi_file_items
for file in multi_file_item.files
]
single_file_items = [
file
for file in single_file_items
if file not in local_files_for_multi_file_items
]

upload_payloads.extend(
[
{"items": [file.serialize_v2() for file in file_chunk]}
for file_chunk in chunk(single_file_items, chunk_size)
]
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do it this way so that if in the future, we want to develop a more complex multi-file item upload feature: A single push() operation is capable of uploading single-file items and multi-file items

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is safe to default ignore_dicom_layout to true.

Comment on lines +261 to +266
filtered_files = [
f for f in found_files if str(f) not in files_to_exclude_full_paths
]
if sort:
return natsorted(filtered_files)
return filtered_files
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do this because when uploading multi-file items, the order of the files is important. Natural sorting seems the more appropriate approach to me

@JBWilkie JBWilkie force-pushed the multi-file-push branch 2 times, most recently from c69d1db to b4a52c0 Compare September 3, 2024 10:44
@JBWilkie JBWilkie changed the title [DAR-3513][DAR-3514][DAR-3515][DAR-3516][DAR-3517] Multi-file push [DAR-3513][DAR-3514][DAR-3515][DAR-3516][DAR-3517][External] Multi-file push Sep 3, 2024
item.reason,
)
for slot in item.slots:
if slot["reason"] != "ALREADY_EXISTS":

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is different from what we do at line 802. Do we expect reason to always be defined and uppercase at this point?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From looking at the relevant BE section, yes - Every blocked item will have a reason, and all the reasons are uppercase. However, I don't see a downside to introduce handling for potentially missing or lowercase reasons, so I can adjust this section to function as above

other_skipped_items.append(item)
for slot in item.slots:
if (slot["reason"] is not None) and (
slot["reason"].upper() == "ALREADY_EXISTS"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this was pre-existing but I fear strings such as "ALREADY_EXISTS" and "UPLOAD_REQUEST" it's too easy to make a typo. I usually create a var to contain such constants.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll move reasons into a const based on their BE values

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, "UPLOAD_REQUEST" doesn't appear to be a BE constant. Instead, it looks like an arbitrarily chosen string to represent the stage of the upload where the error occurred

Section of PR where it was introduced

Comment on lines 225 to 230
merge_incompatible_args = {
"preserve_folders": preserve_folders,
"as_frames": as_frames,
"extract_views": extract_views,
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here IMO high risk of typo, I'd:

# Define constants for the keys
PRESERVE_FOLDERS_KEY = "preserve_folders"
AS_FRAMES_KEY = "as_frames"
EXTRACT_VIEWS_KEY = "extract_views"

# Use the constants in the dictionary
merge_incompatible_args = {
    PRESERVE_FOLDERS_KEY: preserve_folders,
    AS_FRAMES_KEY: as_frames,
    EXTRACT_VIEWS_KEY: extract_views,
}

@@ -842,3 +862,136 @@ def register_multi_slotted(
print(f" - {item}")
print(f"Reistration complete. Check your items in the dataset: {self.slug}")
return results


def _find_files_to_upload_merging(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: _find_files_to_upload_with_merging. I think it reads a bit better.
Looking at the function below this could also be _find_files_to_upload_as_multislot?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in comment below, there are better, more generic names I'll use

Comment on lines 874 to 877
Finds files to upload as either:
- Multi-slotted items
- Multi-channel items
- Single-slotted items containing multiple `.dcm` files

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this too specific? In case we'll add an additional modality we'll have to modify the doc as well. To be fair not a big deal, but it's easy to forget to updated the doc. We could say:
find files to upload accordingly to the item_merge_mode?

return local_files, multi_file_items


def _find_files_to_upload_no_merging(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_find_files_to_upload_as_singleslot reads a bit better. Usually I avoid to define something by negating something else.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about naming the pair of functions:

  • _find_files_to_upload_as_single_file_items
  • _find_files_to_upload_as_multi_file_items

This is generic enough that we can add modalities in the future

Creates the layout to be used when uploading the files as a dataset item:
- For multi-slotted items: LayoutV2
- For series items: LayoutV2, but only with `.dcm` files
- For multi-channel items: LayoutV3

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB. LayoutV3 can be used also to upload multi-slot items and series. Maybe it's easier to use a single layout version below?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to maintain the ability to upload files with >16 slots, then unfortunately I think we have to stick to LayoutV2. Confirming with the channels engineers

Comment on lines +269 to +271
raise ValueError(
f"No multi-channel item can have more than 16 files. The following directory has {len(self.files)} files: {self.directory}"
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we get this from the API response? So we don't have to update dpy when changing this on the BE?

Copy link
Collaborator Author

@JBWilkie JBWilkie Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, yes. The problem is: How do we get the limit from the API? Is there a call we can make to get the limit?

My understanding is that there isn't a limit at the moment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't, basically dpy tries to upload and gets an error from the BE. We could argue it's late in terms of UX, but I've the feeling the limitation will vary in the future and I'm not sure it's a good use of our time to update dpy everytime for such scenario 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see 2 difficulties with this:

  • If we get an upload error from the BE that we've exceeded the limit, how do we know what the limit is? Is the error guaranteed to include the limit?
  • This section of the code where we're creating each MultiFileItem isn't where we perform the upload. We construct all the MultiFileItems first, then move to upload afterward. This means that if one item exceeds the limit but others don't, we could end up with a partially completed upload at the point or error

"slots_grid": [[[file.local_path.name for file in self.files]]],
}

def serialize_v2(self):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does v2 refer to the layout version? Cause I also see the CHANNELS mode managed below and that is only v3

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v2 here refers to version 2 of Darwin JSON - It's leftover from when darwin-py used to support Darwin JSON 1.0 as well. The function can be renamed to serialize_darwin_json_v2 or something similar?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that could work 💯

f"Cannot match {item.full_path} from payload with files to upload"
for slot in item.slots:
upload_id = slot["upload_id"]
slot_path = (Path(item.path) / Path((slot["file_name"]))).as_posix()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is slot a TypedDict? In that case we shouldn't be able to have a typo in file_name

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slot is part of a raw API response so a typo shouldn't be possible, but I'm not sure I follow - Could you expand?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you type "file_nameS" would the linter tell you that it's not a valid value or would we get an error only at runtime?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed, I've put in 2 classes for better type safety of API responses

@JBWilkie JBWilkie merged commit 59de0b3 into master Sep 26, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants