[DAR-3513][DAR-3514][DAR-3515][DAR-3516][DAR-3517][External] Multi-file `push` #923

JBWilkie · 2024-09-02T11:52:52Z

Problem

Currently, push only supports single-slotted, single-file items. It should support multi-slotted, multi-channel, and DICOM series items

Solution

Support for multi-file items is added through the item_merge_mode push command. Details below:

item_merge_mode: Enum["slot", "series", "channel"]

It is available through both CLI & SDK-initiated push() with this syntax:

# CLI
darwin dataset push /path/to/files --item-merge-mode [slots | series | channels]

# SDK
from darwin.client import Client

team_slug = "team_slug"
dataset_slug = "dataset_slug"
files_to_upload = ["path/to/dir_1", "path/to/dir_2", "...", "path/to/dir_n"]
item_merge_mode = # oneof["slots" | "series" | "channels"]

client = Client.local()
dataset = client.get_remote_dataset(dataset_identifier=f"{team_slug}/{dataset_slug}")
dataset.push(files_to_upload=files_to_upload, item_merge_mode=item_merge_mode)

Because it is incompatible with preserve_folders, we raise an error if preserve_folders is set:
dataset.push(files_to_upload=files_to_upload, item_merge_mode=item_merge_mode, preserve_folders=True)

Given this, we throw 1 non-blocking warning for each instance of:

A file path passed to push() that references a specific file rather than a directory, since these are ignored
Each folder that after being parsed according to the specified item_merge_mode rules, results in an empty directory

Behaviour of `item_merge_mode` options

Each file path passed to push must be a directory. The files inside the 1st level of that directory are treated as follows:

Multi-slotted items

Each file is assigned it's own slot
Dataset items are named after the folder containing the files that result in the item
Slots are named as incrementing integers counting from 0
The layout shape is set according to the number of files to ensure an optimal layout shape

Multi-channel items

Each file is assigned to a channel
Dataset items are named after the folder containing the files that result in the item
Each slot is named after the file in the slot
No file may have more than 16 channels, we throw an error if this occurs

DICOM series

Each .dcm file is concatenated into the same slot
Other file types are ignored
The slot is named after the folder containing the files that result in the item
Since we can assume users intend to concatenate input files, we apply the ignore_dicom_layout option to ensure they are concatenated, even if multiple volumes are detected

Changelog

Added push support for multi-file items, specifically:

Multi-slotted items
Multi-channel items
DICOM series consisting of multiple .dcm concatenated files

linear · 2024-09-02T11:52:55Z

DAR-3513 Add the `item_merge_mode` argument to `push()`

DAR-3515 Add the `item_merge_mode` option for `series`

DAR-3516 Add the `item_merge_mode` option for `channels`

DAR-3514 Add the `item_merge_mode` option for `slots`

wiz-inc-4ad3b29aa7 · 2024-09-02T11:53:04Z

Wiz Scan Summary

Scan Module						Total
IaC Misconfigurations	0	0	0	0	0	0
Vulnerabilities	0	2	1	0	0	3
Sensitive Data	0	0	0	0	0	0
Secrets	0	0	0	0	0	0

Total	0	2	1	0	0	3

View scan details in Wiz

linear · 2024-09-02T11:54:46Z

DAR-3517 Add the `ignore_dicom_layout` option to relevant upload payloads (estimated at S)

JBWilkie · 2024-09-02T12:29:28Z

darwin/dataset/upload_manager.py

+
+    # Need to add a `data` attribute for MultFileItem? Where do we get `fps` from?
+    def serialize_v2(self):
+        optional_properties = ["fps"]


optional_properties only contains 1 item for now, but this is to mirror how the serialize_v2() method works for LocalFile. If we add support for others such as as_frames or extract_views in future, this will be easy to extend

JBWilkie · 2024-09-02T12:38:11Z

darwin/dataset/upload_manager.py

+        single_file_items = self.local_files
+        upload_payloads = []
+        if self.multi_file_items:
+            upload_payloads.extend(
+                [
+                    {
+                        "items": [file.serialize_v2() for file in file_chunk],
+                        "options": {"ignore_dicom_layout": True},
+                    }
+                    for file_chunk in chunk(self.multi_file_items, chunk_size)
+                ]
+            )
+            local_files_for_multi_file_items = [
+                file
+                for multi_file_item in self.multi_file_items
+                for file in multi_file_item.files
+            ]
+            single_file_items = [
+                file
+                for file in single_file_items
+                if file not in local_files_for_multi_file_items
+            ]
+
+        upload_payloads.extend(
+            [
+                {"items": [file.serialize_v2() for file in file_chunk]}
+                for file_chunk in chunk(single_file_items, chunk_size)
+            ]
+        )


We do it this way so that if in the future, we want to develop a more complex multi-file item upload feature: A single push() operation is capable of uploading single-file items and multi-file items

I think it is safe to default ignore_dicom_layout to true.

JBWilkie · 2024-09-02T12:40:27Z

darwin/utils/utils.py

+    filtered_files = [
+        f for f in found_files if str(f) not in files_to_exclude_full_paths
+    ]
+    if sort:
+        return natsorted(filtered_files)
+    return filtered_files


We do this because when uploading multi-file items, the order of the files is important. Natural sorting seems the more appropriate approach to me

umbertoDifa · 2024-09-10T08:56:14Z

darwin/cli_functions.py

-                    item.reason,
-                )
+            for slot in item.slots:
+                if slot["reason"] != "ALREADY_EXISTS":


This check is different from what we do at line 802. Do we expect reason to always be defined and uppercase at this point?

From looking at the relevant BE section, yes - Every blocked item will have a reason, and all the reasons are uppercase. However, I don't see a downside to introduce handling for potentially missing or lowercase reasons, so I can adjust this section to function as above

umbertoDifa · 2024-09-10T08:57:17Z

darwin/cli_functions.py

-                other_skipped_items.append(item)
+            for slot in item.slots:
+                if (slot["reason"] is not None) and (
+                    slot["reason"].upper() == "ALREADY_EXISTS"


I know this was pre-existing but I fear strings such as "ALREADY_EXISTS" and "UPLOAD_REQUEST" it's too easy to make a typo. I usually create a var to contain such constants.

Good point. I'll move reasons into a const based on their BE values

Also, "UPLOAD_REQUEST" doesn't appear to be a BE constant. Instead, it looks like an arbitrarily chosen string to represent the stage of the upload where the error occurred

Section of PR where it was introduced

umbertoDifa · 2024-09-10T08:59:58Z

darwin/dataset/remote_dataset_v2.py

+        merge_incompatible_args = {
+            "preserve_folders": preserve_folders,
+            "as_frames": as_frames,
+            "extract_views": extract_views,
+        }
+


Also here IMO high risk of typo, I'd:

# Define constants for the keys PRESERVE_FOLDERS_KEY = "preserve_folders" AS_FRAMES_KEY = "as_frames" EXTRACT_VIEWS_KEY = "extract_views" # Use the constants in the dictionary merge_incompatible_args = { PRESERVE_FOLDERS_KEY: preserve_folders, AS_FRAMES_KEY: as_frames, EXTRACT_VIEWS_KEY: extract_views, }

umbertoDifa · 2024-09-10T09:37:32Z

darwin/dataset/remote_dataset_v2.py

@@ -842,3 +862,136 @@ def register_multi_slotted(
                print(f"  - {item}")
        print(f"Reistration complete. Check your items in the dataset: {self.slug}")
        return results
+
+
+def _find_files_to_upload_merging(


nit: _find_files_to_upload_with_merging. I think it reads a bit better.
Looking at the function below this could also be _find_files_to_upload_as_multislot?

Addressed in comment below, there are better, more generic names I'll use

umbertoDifa · 2024-09-10T09:38:54Z

darwin/dataset/remote_dataset_v2.py

+    Finds files to upload as either:
+        - Multi-slotted items
+        - Multi-channel items
+        - Single-slotted items containing multiple `.dcm` files


Is this too specific? In case we'll add an additional modality we'll have to modify the doc as well. To be fair not a big deal, but it's easy to forget to updated the doc. We could say:
find files to upload accordingly to the item_merge_mode?

umbertoDifa · 2024-09-10T09:51:28Z

darwin/dataset/remote_dataset_v2.py

+    return local_files, multi_file_items
+
+
+def _find_files_to_upload_no_merging(


_find_files_to_upload_as_singleslot reads a bit better. Usually I avoid to define something by negating something else.

How about naming the pair of functions:

_find_files_to_upload_as_single_file_items

_find_files_to_upload_as_multi_file_items

This is generic enough that we can add modalities in the future

umbertoDifa · 2024-09-10T09:54:44Z

darwin/dataset/upload_manager.py

+        Creates the layout to be used when uploading the files as a dataset item:
+        - For multi-slotted items: LayoutV2
+        - For series items: LayoutV2, but only with `.dcm` files
+        - For multi-channel items: LayoutV3


NB. LayoutV3 can be used also to upload multi-slot items and series. Maybe it's easier to use a single layout version below?

If we want to maintain the ability to upload files with >16 slots, then unfortunately I think we have to stick to LayoutV2. Confirming with the channels engineers

umbertoDifa · 2024-09-10T09:56:04Z

darwin/dataset/upload_manager.py

+                raise ValueError(
+                    f"No multi-channel item can have more than 16 files. The following directory has {len(self.files)} files: {self.directory}"
+                )


Should we get this from the API response? So we don't have to update dpy when changing this on the BE?

Ideally, yes. The problem is: How do we get the limit from the API? Is there a call we can make to get the limit?

My understanding is that there isn't a limit at the moment

There isn't, basically dpy tries to upload and gets an error from the BE. We could argue it's late in terms of UX, but I've the feeling the limitation will vary in the future and I'm not sure it's a good use of our time to update dpy everytime for such scenario 🤔

I see 2 difficulties with this:

If we get an upload error from the BE that we've exceeded the limit, how do we know what the limit is? Is the error guaranteed to include the limit?

This section of the code where we're creating each MultiFileItem isn't where we perform the upload. We construct all the MultiFileItems first, then move to upload afterward. This means that if one item exceeds the limit but others don't, we could end up with a partially completed upload at the point or error

umbertoDifa · 2024-09-10T10:02:08Z

darwin/dataset/upload_manager.py

+                "slots_grid": [[[file.local_path.name for file in self.files]]],
+            }
+
+    def serialize_v2(self):


does v2 refer to the layout version? Cause I also see the CHANNELS mode managed below and that is only v3

v2 here refers to version 2 of Darwin JSON - It's leftover from when darwin-py used to support Darwin JSON 1.0 as well. The function can be renamed to serialize_darwin_json_v2 or something similar?

Yeah that could work 💯

umbertoDifa · 2024-09-10T10:06:52Z

darwin/dataset/upload_manager.py

-                    f"Cannot match {item.full_path} from payload with files to upload"
+            for slot in item.slots:
+                upload_id = slot["upload_id"]
+                slot_path = (Path(item.path) / Path((slot["file_name"]))).as_posix()


Is slot a TypedDict? In that case we shouldn't be able to have a typo in file_name

slot is part of a raw API response so a typo shouldn't be possible, but I'm not sure I follow - Could you expand?

If you type "file_nameS" would the linter tell you that it's not a valid value or would we get an error only at runtime?

Changed, I've put in 2 classes for better type safety of API responses

JBWilkie added 6 commits August 27, 2024 22:54

Added to CLI & SDK + created MultiFileItem parsing & layout building

91dc524

Unit test coverage

cd806a9

Cleaned up tests

bd32517

Added natural sorting to the order of files in multi-file items

a123c88

Fix for failing test on Windows

3c678b6

Working upload for multi-slot, multi-channel, & DICOM series items

e066974

JBWilkie mentioned this pull request Sep 2, 2024

[DAR-3513][External] Addition of the item_merge_mode push argument #920

Closed

JBWilkie changed the title ~~[DAR-3513][DAR-3514][DAR-3515][DAR-3516] Multi-file push~~ [DAR-3513][DAR-3514][DAR-3515][DAR-3516][DAR-3517] Multi-file push Sep 2, 2024

JBWilkie commented Sep 2, 2024

View reviewed changes

Multi-slot, multi-channel, & DICOM series

2fa72f8

JBWilkie force-pushed the multi-file-push branch 2 times, most recently from c69d1db to b4a52c0 Compare September 3, 2024 10:44

Test improvements

ce85a68

JBWilkie force-pushed the multi-file-push branch from b4a52c0 to ce85a68 Compare September 3, 2024 11:10

JBWilkie changed the title ~~[DAR-3513][DAR-3514][DAR-3515][DAR-3516][DAR-3517] Multi-file push~~ [DAR-3513][DAR-3514][DAR-3515][DAR-3516][DAR-3517][External] Multi-file push Sep 3, 2024

Improved checking of args that are incompatible with item_merge_mode

da92554

JBWilkie force-pushed the multi-file-push branch from 6ee0fad to da92554 Compare September 4, 2024 14:50

Fix for Windows filepaths

b5ef828

JBWilkie force-pushed the multi-file-push branch from e3d261e to b5ef828 Compare September 9, 2024 20:56

umbertoDifa reviewed Sep 10, 2024

View reviewed changes

Always use LayoutV3 + misc improvements

10826c1

shernshiou approved these changes Sep 16, 2024

View reviewed changes

JBWilkie added 2 commits September 17, 2024 09:46

Merge branch 'master' into multi-file-push

d8f6de2

Better API response typing + other misc changes

09c623e

JBWilkie merged commit 59de0b3 into master Sep 26, 2024
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DAR-3513][DAR-3514][DAR-3515][DAR-3516][DAR-3517][External] Multi-file `push` #923

[DAR-3513][DAR-3514][DAR-3515][DAR-3516][DAR-3517][External] Multi-file `push` #923

JBWilkie commented Sep 2, 2024 •

edited

Loading

linear bot commented Sep 2, 2024

wiz-inc-4ad3b29aa7 bot commented Sep 2, 2024 •

edited

Loading

linear bot commented Sep 2, 2024

JBWilkie Sep 2, 2024 •

edited

Loading

JBWilkie Sep 2, 2024

shernshiou Sep 4, 2024

JBWilkie Sep 2, 2024

umbertoDifa Sep 10, 2024

JBWilkie Sep 10, 2024

umbertoDifa Sep 10, 2024

JBWilkie Sep 10, 2024

JBWilkie Sep 10, 2024

umbertoDifa Sep 10, 2024

umbertoDifa Sep 10, 2024

JBWilkie Sep 10, 2024

umbertoDifa Sep 10, 2024

umbertoDifa Sep 10, 2024

JBWilkie Sep 10, 2024

umbertoDifa Sep 10, 2024

JBWilkie Sep 10, 2024

umbertoDifa Sep 10, 2024

JBWilkie Sep 10, 2024 •

edited

Loading

umbertoDifa Sep 10, 2024

JBWilkie Sep 12, 2024

umbertoDifa Sep 10, 2024

JBWilkie Sep 10, 2024

umbertoDifa Sep 10, 2024

umbertoDifa Sep 10, 2024

JBWilkie Sep 10, 2024

umbertoDifa Sep 10, 2024

JBWilkie Sep 17, 2024

		return local_files, multi_file_items


		def _find_files_to_upload_no_merging(

[DAR-3513][DAR-3514][DAR-3515][DAR-3516][DAR-3517][External] Multi-file push #923

[DAR-3513][DAR-3514][DAR-3515][DAR-3516][DAR-3517][External] Multi-file push #923

Conversation

JBWilkie commented Sep 2, 2024 • edited Loading

Problem

Solution

Behaviour of item_merge_mode options

Multi-slotted items

Multi-channel items

DICOM series

Changelog

linear bot commented Sep 2, 2024

wiz-inc-4ad3b29aa7 bot commented Sep 2, 2024 • edited Loading

Wiz Scan Summary

linear bot commented Sep 2, 2024

JBWilkie Sep 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JBWilkie Sep 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[DAR-3513][DAR-3514][DAR-3515][DAR-3516][DAR-3517][External] Multi-file `push` #923

[DAR-3513][DAR-3514][DAR-3515][DAR-3516][DAR-3517][External] Multi-file `push` #923

JBWilkie commented Sep 2, 2024 •

edited

Loading

Behaviour of `item_merge_mode` options

wiz-inc-4ad3b29aa7 bot commented Sep 2, 2024 •

edited

Loading

JBWilkie Sep 2, 2024 •

edited

Loading

JBWilkie Sep 10, 2024 •

edited

Loading