Add barcode logic to CytoSnake's CLI #46

axiomcura · 2023-05-04T16:13:36Z

About this PR

This PR adds CytoSnake to have logic when handling CLI user based inputs. Specifically, this update introduces barcode logic,

By default, a barcode is not required as an input to run CytoSnake; however, there are some exceptions when barcodes are needed.

If a user provides a dataset that has been generated from multiple experiments, then multiple plate maps are associated with the generated data. This will require a barcode file in order for CytoSnake to know which plate dataset is associated with which experiment.

What's new?

A new module known as input_guard.py was created. This will handle all the CLI logic
New function: check_init_parameter_inputs() where it takes user based parameters and checks for discrepancies.

Implementation

Barcode Logic Design. This image above present a diagram on how the barcode logic is handled within CytoSnake. In this example, we have 6 plate datasets that are seperated into two groups. Each group represents an experiment that has been conducted in order to generate the data. In addition, there is a metadata folder. where each platemap file is associated with the group of plate data. A) Demonstrates a user providing both plate data groups and a metadata file but fails to complete the init mode causing the raise barcode error to light up. B) Demonstrates a succesful init run where the user provides all the necessary inputs and makes the conduct init to light up.

What dictates the barcode logic is the number of plate maps found within the metadata data folder. If CytoSnake see's that there is more than 1 plate map, then it requires a barcode. Therefore, users must provide barcodes if multiple plate maps are present.

additional Notes

Changes in workflow

Update on workflows:

This should have been taken care of in New Workflow: cp_process_singlecells ! #37, where workflow configs were created for cp_process_singlecells; however, workflow config was not created for cp_process hence this PR contains a new workflow config for cp_process

d33bs

Nice work! I left a few comments and overall thought things looked good! Please don't hesitate to let me know if you have any questions.

One additional thought:

I love the activity diagram you provided in the PR description! In reading it I wondered: why are barcodes only required for multiple experiments (and not for single experiments)? For example, if two experiments need to be compared but they are processed individually, would we run into issues when attempting to compare things later on?
Similarly: are there ever scenarios where we don't have the barcode file but need to run analyses on the experiments? Here especially I'm thinking about previously gathered data where one may no longer have access to all data, or perhaps the data is stored in an unrecognizable format. In these scenarios could you simulate the barcode file's data (providing notation that it's simulated) to help facilitate the work involved with this PR?

d33bs · 2023-05-04T17:28:28Z

configs/wf_configs/cp_process.yaml

+    compression_options:
+      method: "gzip"
+      mtime: 1


Seeing how patterns in the configs occur (like compression_options), consider making use of global variable references (if possible) within these files. These could be referenced within the individual fields to ensure consistency and reduce maintenance in the instances where change is required.

Hmmm. It seems like you can assign variables within the configurations and allow re-using: (learned something new today)

Here's an example below:

# test.yaml file compression_options: &DEFAULT_COMPRESSION method: "gzip" mtime: 1 first_value: *DEFAULT_COMPRESSION

here's the code that is used to read the test yaml file:

import yaml with open("./test.yaml", mode="r") as f: data = yaml.safe_load(f) print(data)

And here's the output:

{'compression_options': {'method': 'gzip', 'mtime': 1}, 'first_value': {'method': 'gzip', 'mtime': 1}}

So it seems like you can but, however, I could see some issues with this in terms of readability. Maybe I am naïve, but I have not seen a config files that uses global variables within them. This begs the question if using config variables is a common trend? For example, do we expect the majority of users to know what these variables are and how it is used within the config file? (I do not know the right answer to this, maybe you, @MattsonCam and/or @gwaybio? Might know this)

I can see this working perfectly in the private configs that exist within the .cytosnake directory for development purpose, but I am quite "iffy" about how users will react to variables in config files.

Great findings! One way you could consider addressing understandability here would be with comments (for ex: # this line creates a global config variable used later as *VARIABLE_NAME). I understand reasons why you might consider not using this method and defer to you on what's best.

cytosnake/cli/cmd.py

workflows/workflow/cp_process.smk

cytosnake/guards/input_guards.py

d33bs · 2023-05-04T17:39:29Z

cytosnake/guards/input_guards.py

+    if not is_barcode_required:
+        BarcodeRequiredError("Barcode is required, multiple platemaps found")


Double checking: is this block checking whether the barcode file is required or whether it is missing? If it's checking for whether it's missing, consider using "missing" (or similar) in the variable and object names for clarity.

It is checking if the barcode is missing.

Are you suggesting something like this?:

if is_missing_barcode: BarcodeRequiredError("Barcode is missing")

I have done some changes with the naming, let me know if it works.

Thank you for the clarification! In addition to the message, I'm wondering if the exception name itself also should reflect what is "exceptional" or "erroneous". For example, BarcodeMissingError or similar. (this might require a change to the exception itself).

Ah good point. I'll apply those changes.

.pre-commit-config.yaml

configs/wf_configs/cp_process.yaml

Co-authored-by: Dave Bunten <ekgto445@gmail.com>

axiomcura · 2023-05-08T16:03:15Z

@d33bs Hopefully I have attended all your comments. Also, thanks for the great questions!

why are barcodes only required for multiple experiments (and not for single experiments)? For example, if two experiments need to be compared but they are processed individually, would we run into issues when attempting to compare things later on?

I will be used this repo to explain my current understanding.

From what I understand, the barcode file provides an assay-to-platename pairing. Where the assay plate are sqlite files (in this case) from cytominer-database. In the barcode file we see that there is an association between a specific plate map (Plate_Map_Name column) per assay., which contains metadata information that includes: well position, perturbations, etc. Looking at the barcode structure, there are some plate names that repeat 3 times (first 3, middle 3 and last 3) indicating that 3 experiments were conducted in triplicates. (Different plate map name = separate experiment)

Technically, there is no need to have a barcode for a single experiment because it will contain the same external factors among all plates (assuming that more than one plate was used in the experiment). The only time when barcodes are required is if 3 separate experiments were conducted on multiple plates. Therefore, the barcode will help find which plates have been involved with which experiment, thus mapping the correct metadata to those plates when conducting downstream analysis.

would we run into issues when attempting to compare things later on?
Since the metadata (platemaps) can be incorporated within the single-cell / aggregate morphological profiles, you can stratify them based on experiments. The merging of the metadata to the morphology profiles is conducted by using the pcytominer's annotate where it requires both the profile and platemap as inputs.

However, one needs to map the correct assay with the associated platemap, which CytoSnake does when annotating multiple plate datasets (assays)

Similarly: are there ever scenarios where we don't have the barcode file but need to run analyses on the experiments? Here especially I'm thinking about previously gathered data where one may no longer have access to all data, or perhaps the data is stored in an unrecognizable format. In these scenarios could you simulate the barcode file's data (providing notation that it's simulated) to help facilitate the work involved with this PR?

The barcodes only provides information that distinguishes which plates came from which experiment. Assuming that the data you are talking about came from 3 separate experiments and no barcodes were provided. A potential solution is that we contact the person who generated this dataset and asks which plates came from which experiment.

However, with the scenario, if no plate maps were provided, then we will not know what types of external treatments were added to the cell and which experiments contained the types of treatments/cell lines used. Therefore, it will be difficult to simulate due to the lack of important metadata data like treatments, well positions, and cell lines used.

d33bs

Great work! Excited to see the testing additions! I left a few comments and suggestions throughout this review, please don't hesitate to let me know if you have a question about anything.

Also very much appreciate your answers to the barcode details! Your response seems like great documentation material potentially - would it make sense to store that information for later reference in this project?

d33bs · 2023-05-08T19:31:13Z

configs/wf_configs/cp_process.yaml

+    compression_options:
+      method: "gzip"
+      mtime: 1


Great findings! One way you could consider addressing understandability here would be with comments (for ex: # this line creates a global config variable used later as *VARIABLE_NAME). I understand reasons why you might consider not using this method and defer to you on what's best.

cytosnake/tests/functional/test_cli.py

d33bs · 2023-05-08T19:40:52Z

cytosnake/tests/functional/test_cli.py

+    """Used to clean up directories in every single test run"""
+
+    def __init__(self, tmp_path):
+        self.tmp_path = tmp_path


Consider the use of tempfile here to assist with the creation of temporary files or directories.

cytosnake/tests/functional/test_cli.py

d33bs · 2023-05-08T19:59:56Z

workflows/scripts/annotate.py

@@ -62,7 +62,7 @@ def annotate_cells(
    barcode_platemap_df = pd.read_csv(barcodes_path)

    logging.info("Searching plate map name")
-    plate = Path(aggregated_data).name.split("_")[0]
+    Path(aggregated_data).name.split("_")[0]


Double checking: what will this line do - is it still necessary? If not, consider removing it.

sorry, this was a mistake! thanks for pointing this out

workflows/workflow/cp_process.smk

d33bs · 2023-05-08T20:16:08Z

cytosnake/tests/functional/test_cli.py

+    """
+
+    # get files to transfer
+    dataset_dir = pathlib.Path(f"./datasets/{testing_data_dir}").resolve(strict=True)


Pytest is sometimes used to run the entirety of the "tests" directory without being located within the relative structure of sub-folders. That said, consider making sure this line (and possibly others) are able to run without changes or document the expectations of this test file here (or alternatively within the docstring for the module).

Ah I see. Thanks for pointing this out. I doubled checked and added some documentation in regards to this!

d33bs · 2023-05-08T20:29:21Z

cytosnake/tests/functional/test_cli.py

+    # change directory to tmpdir
+    os.chdir(tmp_path)
+
+    # execute CytoSnake
+    cmd = "cytosnake init -d *.sqlite -m metadata"
+    proc = subprocess.run(cmd, shell=True, capture_output=True, text=True, check=False)
+    raised_error = get_raised_error(proc.stderr)
+
+    # leave testing dir
+    os.chdir(test_module)


If possible, consider sending the subdirectory information directly to CytoSnake, avoiding the need to change directory both into and out of the tmpdir. This might be my own misunderstanding, so please feel free to ignore if this isn't possible or useful in the context of this test.

Alternatively, seeing how this pattern repeats, it may be useful to create a context manager for remembering to move back to a directory after changes have occurred. See this SO reference for one example.

Thanks for the resource. This will be considered in #49

d33bs · 2023-05-08T20:45:28Z

cytosnake/guards/input_guards.py

+    if not is_barcode_required:
+        BarcodeRequiredError("Barcode is required, multiple platemaps found")


Thank you for the clarification! In addition to the message, I'm wondering if the exception name itself also should reflect what is "exceptional" or "erroneous". For example, BarcodeMissingError or similar. (this might require a change to the exception itself).

Co-authored-by: Dave Bunten <ekgto445@gmail.com>

axiomcura · 2023-05-09T19:27:17Z

I have applied all the changes. Merging now. If there is more work need to be done, please feel free to re-open this PR.

axiomcura and others added 6 commits March 14, 2023 13:44

fixed minor pathing bugs

74f9959

Merge branch 'WayScience:main' into main

fbd9ffa

Merge branch 'WayScience:main' into main

32b9f18

Merge branch 'WayScience:main' into main

6a76c3c

update workflow structure, added cp_process config

8be9079

added barcode logic

8a6bb6a

axiomcura changed the title ~~Add barcode logic~~ Add barcode logic to CytoSnake's CLI May 4, 2023

axiomcura marked this pull request as ready for review May 4, 2023 17:11

axiomcura requested a review from d33bs May 4, 2023 17:14

d33bs approved these changes May 4, 2023

View reviewed changes

axiomcura self-assigned this May 4, 2023

axiomcura and others added 13 commits May 4, 2023 19:10

Update cytosnake/cli/cmd.py

00d9da6

Co-authored-by: Dave Bunten <ekgto445@gmail.com>

Update cytosnake/guards/input_guards.py

e5c42a7

Co-authored-by: Dave Bunten <ekgto445@gmail.com>

Update cytosnake/guards/input_guards.py

5fd8b92

Co-authored-by: Dave Bunten <ekgto445@gmail.com>

pre-commit update

c3d8f41

applied new versions of pre-commit workflow

63b8de8

created stubbed functional tests

450f521

added dummy data for tests

1dbbbe0

added testing framework

5ba8421

applied ruff automatic fixing

a3f6331

fixed barcode logic bug

c0edc18

barcode logic tests

585e90d

added workflow documentation

e8c135e

update workflow documentation

68bd1b4

d33bs approved these changes May 8, 2023

View reviewed changes

Update cytosnake/tests/functional/test_cli.py

34c0fbc

Co-authored-by: Dave Bunten <ekgto445@gmail.com>

axiomcura mentioned this pull request May 9, 2023

Implement PyTest fixtures for teardown code #48

Open

applied comments

22c9f5e

axiomcura merged commit 274665e into WayScience:main May 9, 2023

axiomcura deleted the add-barcode-logic branch May 16, 2023 12:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add barcode logic to CytoSnake's CLI #46

Add barcode logic to CytoSnake's CLI #46

axiomcura commented May 4, 2023 •

edited

Loading

d33bs left a comment

d33bs May 4, 2023

axiomcura May 5, 2023

d33bs May 8, 2023

d33bs May 4, 2023

axiomcura May 5, 2023

axiomcura May 8, 2023

d33bs May 8, 2023

axiomcura May 9, 2023

axiomcura commented May 8, 2023

d33bs left a comment

d33bs May 8, 2023

d33bs May 8, 2023

d33bs May 8, 2023

axiomcura May 9, 2023

d33bs May 8, 2023

axiomcura May 9, 2023

d33bs May 8, 2023

axiomcura May 9, 2023

d33bs May 8, 2023

axiomcura commented May 9, 2023

		if not is_barcode_required:
		BarcodeRequiredError("Barcode is required, multiple platemaps found")

Add barcode logic to CytoSnake's CLI #46

Add barcode logic to CytoSnake's CLI #46

Conversation

axiomcura commented May 4, 2023 • edited Loading

About this PR

What's new?

Implementation

additional Notes

Changes in workflow

d33bs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

axiomcura commented May 8, 2023

d33bs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

axiomcura commented May 9, 2023

axiomcura commented May 4, 2023 •

edited

Loading