Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add barcode logic to CytoSnake's CLI #46

Merged
merged 21 commits into from
May 9, 2023

Conversation

axiomcura
Copy link
Member

@axiomcura axiomcura commented May 4, 2023

About this PR

This PR adds CytoSnake to have logic when handling CLI user based inputs. Specifically, this update introduces barcode logic,

By default, a barcode is not required as an input to run CytoSnake; however, there are some exceptions when barcodes are needed.

If a user provides a dataset that has been generated from multiple experiments, then multiple plate maps are associated with the generated data. This will require a barcode file in order for CytoSnake to know which plate dataset is associated with which experiment.

What's new?

  • A new module known as input_guard.py was created. This will handle all the CLI logic
  • New function: check_init_parameter_inputs() where it takes user based parameters and checks for discrepancies.

Implementation

Screenshot 2023-05-03 at 10 02 41 PM

Barcode Logic Design. This image above present a diagram on how the barcode logic is handled within CytoSnake. In this example, we have 6 plate datasets that are seperated into two groups. Each group represents an experiment that has been conducted in order to generate the data. In addition, there is a metadata folder. where each platemap file is associated with the group of plate data. A) Demonstrates a user providing both plate data groups and a metadata file but fails to complete the init mode causing the raise barcode error to light up. B) Demonstrates a succesful init run where the user provides all the necessary inputs and makes the conduct init to light up.

What dictates the barcode logic is the number of plate maps found within the metadata data folder. If CytoSnake see's that there is more than 1 plate map, then it requires a barcode. Therefore, users must provide barcodes if multiple plate maps are present.

additional Notes

Changes in workflow

Update on workflows:

  • This should have been taken care of in New Workflow: cp_process_singlecells ! #37, where workflow configs were created for cp_process_singlecells; however, workflow config was not created for cp_process hence this PR contains a new workflow config for cp_process

@axiomcura axiomcura changed the title Add barcode logic Add barcode logic to CytoSnake's CLI May 4, 2023
@axiomcura axiomcura marked this pull request as ready for review May 4, 2023 17:11
@axiomcura axiomcura requested a review from d33bs May 4, 2023 17:14
Copy link
Member

@d33bs d33bs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I left a few comments and overall thought things looked good! Please don't hesitate to let me know if you have any questions.

One additional thought:

  • I love the activity diagram you provided in the PR description! In reading it I wondered: why are barcodes only required for multiple experiments (and not for single experiments)? For example, if two experiments need to be compared but they are processed individually, would we run into issues when attempting to compare things later on?
  • Similarly: are there ever scenarios where we don't have the barcode file but need to run analyses on the experiments? Here especially I'm thinking about previously gathered data where one may no longer have access to all data, or perhaps the data is stored in an unrecognizable format. In these scenarios could you simulate the barcode file's data (providing notation that it's simulated) to help facilitate the work involved with this PR?

Comment on lines +43 to +45
compression_options:
method: "gzip"
mtime: 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing how patterns in the configs occur (like compression_options), consider making use of global variable references (if possible) within these files. These could be referenced within the individual fields to ensure consistency and reduce maintenance in the instances where change is required.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. It seems like you can assign variables within the configurations and allow re-using: (learned something new today)

Here's an example below:

# test.yaml file
compression_options: &DEFAULT_COMPRESSION
  method: "gzip"
  mtime: 1

first_value: *DEFAULT_COMPRESSION

here's the code that is used to read the test yaml file:

import yaml
with open("./test.yaml", mode="r") as f:
    data = yaml.safe_load(f)
print(data)

And here's the output:

{'compression_options': {'method': 'gzip', 'mtime': 1},
 'first_value': {'method': 'gzip', 'mtime': 1}}

So it seems like you can but, however, I could see some issues with this in terms of readability. Maybe I am naïve, but I have not seen a config files that uses global variables within them. This begs the question if using config variables is a common trend? For example, do we expect the majority of users to know what these variables are and how it is used within the config file? (I do not know the right answer to this, maybe you, @MattsonCam and/or @gwaybio? Might know this)

I can see this working perfectly in the private configs that exist within the .cytosnake directory for development purpose, but I am quite "iffy" about how users will react to variables in config files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great findings! One way you could consider addressing understandability here would be with comments (for ex: # this line creates a global config variable used later as *VARIABLE_NAME). I understand reasons why you might consider not using this method and defer to you on what's best.

cytosnake/cli/cmd.py Outdated Show resolved Hide resolved
cytosnake/cli/cmd.py Outdated Show resolved Hide resolved
workflows/workflow/cp_process.smk Outdated Show resolved Hide resolved
cytosnake/guards/input_guards.py Outdated Show resolved Hide resolved
cytosnake/guards/input_guards.py Outdated Show resolved Hide resolved
Comment on lines 64 to 65
if not is_barcode_required:
BarcodeRequiredError("Barcode is required, multiple platemaps found")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double checking: is this block checking whether the barcode file is required or whether it is missing? If it's checking for whether it's missing, consider using "missing" (or similar) in the variable and object names for clarity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is checking if the barcode is missing.

Are you suggesting something like this?:

if is_missing_barcode:
    BarcodeRequiredError("Barcode is missing")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done some changes with the naming, let me know if it works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the clarification! In addition to the message, I'm wondering if the exception name itself also should reflect what is "exceptional" or "erroneous". For example, BarcodeMissingError or similar. (this might require a change to the exception itself).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point. I'll apply those changes.

.pre-commit-config.yaml Outdated Show resolved Hide resolved
configs/wf_configs/cp_process.yaml Show resolved Hide resolved
@axiomcura axiomcura self-assigned this May 4, 2023
@axiomcura
Copy link
Member Author

@d33bs Hopefully I have attended all your comments. Also, thanks for the great questions!

why are barcodes only required for multiple experiments (and not for single experiments)? For example, if two experiments need to be compared but they are processed individually, would we run into issues when attempting to compare things later on?

I will be used this repo to explain my current understanding.

From what I understand, the barcode file provides an assay-to-platename pairing. Where the assay plate are sqlite files (in this case) from cytominer-database. In the barcode file we see that there is an association between a specific plate map (Plate_Map_Name column) per assay., which contains metadata information that includes: well position, perturbations, etc. Looking at the barcode structure, there are some plate names that repeat 3 times (first 3, middle 3 and last 3) indicating that 3 experiments were conducted in triplicates. (Different plate map name = separate experiment)

Technically, there is no need to have a barcode for a single experiment because it will contain the same external factors among all plates (assuming that more than one plate was used in the experiment). The only time when barcodes are required is if 3 separate experiments were conducted on multiple plates. Therefore, the barcode will help find which plates have been involved with which experiment, thus mapping the correct metadata to those plates when conducting downstream analysis.

would we run into issues when attempting to compare things later on?
Since the metadata (platemaps) can be incorporated within the single-cell / aggregate morphological profiles, you can stratify them based on experiments. The merging of the metadata to the morphology profiles is conducted by using the pcytominer's annotate where it requires both the profile and platemap as inputs.

However, one needs to map the correct assay with the associated platemap, which CytoSnake does when annotating multiple plate datasets (assays)

Similarly: are there ever scenarios where we don't have the barcode file but need to run analyses on the experiments? Here especially I'm thinking about previously gathered data where one may no longer have access to all data, or perhaps the data is stored in an unrecognizable format. In these scenarios could you simulate the barcode file's data (providing notation that it's simulated) to help facilitate the work involved with this PR?

The barcodes only provides information that distinguishes which plates came from which experiment. Assuming that the data you are talking about came from 3 separate experiments and no barcodes were provided. A potential solution is that we contact the person who generated this dataset and asks which plates came from which experiment.

However, with the scenario, if no plate maps were provided, then we will not know what types of external treatments were added to the cell and which experiments contained the types of treatments/cell lines used. Therefore, it will be difficult to simulate due to the lack of important metadata data like treatments, well positions, and cell lines used.

Copy link
Member

@d33bs d33bs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Excited to see the testing additions! I left a few comments and suggestions throughout this review, please don't hesitate to let me know if you have a question about anything.

Also very much appreciate your answers to the barcode details! Your response seems like great documentation material potentially - would it make sense to store that information for later reference in this project?

Comment on lines +43 to +45
compression_options:
method: "gzip"
mtime: 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great findings! One way you could consider addressing understandability here would be with comments (for ex: # this line creates a global config variable used later as *VARIABLE_NAME). I understand reasons why you might consider not using this method and defer to you on what's best.

cytosnake/tests/functional/test_cli.py Show resolved Hide resolved
cytosnake/tests/functional/test_cli.py Outdated Show resolved Hide resolved
"""Used to clean up directories in every single test run"""

def __init__(self, tmp_path):
self.tmp_path = tmp_path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the use of tempfile here to assist with the creation of temporary files or directories.

cytosnake/tests/functional/test_cli.py Outdated Show resolved Hide resolved
@@ -62,7 +62,7 @@ def annotate_cells(
barcode_platemap_df = pd.read_csv(barcodes_path)

logging.info("Searching plate map name")
plate = Path(aggregated_data).name.split("_")[0]
Path(aggregated_data).name.split("_")[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double checking: what will this line do - is it still necessary? If not, consider removing it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, this was a mistake! thanks for pointing this out

workflows/workflow/cp_process.smk Show resolved Hide resolved
"""

# get files to transfer
dataset_dir = pathlib.Path(f"./datasets/{testing_data_dir}").resolve(strict=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pytest is sometimes used to run the entirety of the "tests" directory without being located within the relative structure of sub-folders. That said, consider making sure this line (and possibly others) are able to run without changes or document the expectations of this test file here (or alternatively within the docstring for the module).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. Thanks for pointing this out. I doubled checked and added some documentation in regards to this!

Comment on lines +199 to +208
# change directory to tmpdir
os.chdir(tmp_path)

# execute CytoSnake
cmd = "cytosnake init -d *.sqlite -m metadata"
proc = subprocess.run(cmd, shell=True, capture_output=True, text=True, check=False)
raised_error = get_raised_error(proc.stderr)

# leave testing dir
os.chdir(test_module)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, consider sending the subdirectory information directly to CytoSnake, avoiding the need to change directory both into and out of the tmpdir. This might be my own misunderstanding, so please feel free to ignore if this isn't possible or useful in the context of this test.

Alternatively, seeing how this pattern repeats, it may be useful to create a context manager for remembering to move back to a directory after changes have occurred. See this SO reference for one example.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the resource. This will be considered in #49

Comment on lines 64 to 65
if not is_barcode_required:
BarcodeRequiredError("Barcode is required, multiple platemaps found")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the clarification! In addition to the message, I'm wondering if the exception name itself also should reflect what is "exceptional" or "erroneous". For example, BarcodeMissingError or similar. (this might require a change to the exception itself).

Co-authored-by: Dave Bunten <ekgto445@gmail.com>
@axiomcura
Copy link
Member Author

I have applied all the changes. Merging now. If there is more work need to be done, please feel free to re-open this PR.

@axiomcura axiomcura merged commit 274665e into WayScience:main May 9, 2023
@axiomcura axiomcura deleted the add-barcode-logic branch May 16, 2023 12:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants