-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add barcode logic to CytoSnake's CLI #46
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! I left a few comments and overall thought things looked good! Please don't hesitate to let me know if you have any questions.
One additional thought:
- I love the activity diagram you provided in the PR description! In reading it I wondered: why are barcodes only required for multiple experiments (and not for single experiments)? For example, if two experiments need to be compared but they are processed individually, would we run into issues when attempting to compare things later on?
- Similarly: are there ever scenarios where we don't have the barcode file but need to run analyses on the experiments? Here especially I'm thinking about previously gathered data where one may no longer have access to all data, or perhaps the data is stored in an unrecognizable format. In these scenarios could you simulate the barcode file's data (providing notation that it's simulated) to help facilitate the work involved with this PR?
compression_options: | ||
method: "gzip" | ||
mtime: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing how patterns in the configs occur (like compression_options
), consider making use of global variable references (if possible) within these files. These could be referenced within the individual fields to ensure consistency and reduce maintenance in the instances where change is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm. It seems like you can assign variables within the configurations and allow re-using: (learned something new today)
Here's an example below:
# test.yaml file
compression_options: &DEFAULT_COMPRESSION
method: "gzip"
mtime: 1
first_value: *DEFAULT_COMPRESSION
here's the code that is used to read the test yaml file:
import yaml
with open("./test.yaml", mode="r") as f:
data = yaml.safe_load(f)
print(data)
And here's the output:
{'compression_options': {'method': 'gzip', 'mtime': 1},
'first_value': {'method': 'gzip', 'mtime': 1}}
So it seems like you can but, however, I could see some issues with this in terms of readability. Maybe I am naïve, but I have not seen a config files that uses global variables within them. This begs the question if using config variables is a common trend? For example, do we expect the majority of users to know what these variables are and how it is used within the config file? (I do not know the right answer to this, maybe you, @MattsonCam and/or @gwaybio? Might know this)
I can see this working perfectly in the private configs that exist within the .cytosnake
directory for development purpose, but I am quite "iffy" about how users will react to variables in config files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great findings! One way you could consider addressing understandability here would be with comments (for ex: # this line creates a global config variable used later as *VARIABLE_NAME
). I understand reasons why you might consider not using this method and defer to you on what's best.
cytosnake/guards/input_guards.py
Outdated
if not is_barcode_required: | ||
BarcodeRequiredError("Barcode is required, multiple platemaps found") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double checking: is this block checking whether the barcode file is required or whether it is missing? If it's checking for whether it's missing, consider using "missing" (or similar) in the variable and object names for clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is checking if the barcode is missing.
Are you suggesting something like this?:
if is_missing_barcode:
BarcodeRequiredError("Barcode is missing")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have done some changes with the naming, let me know if it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the clarification! In addition to the message, I'm wondering if the exception name itself also should reflect what is "exceptional" or "erroneous". For example, BarcodeMissingError
or similar. (this might require a change to the exception itself).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good point. I'll apply those changes.
Co-authored-by: Dave Bunten <ekgto445@gmail.com>
Co-authored-by: Dave Bunten <ekgto445@gmail.com>
Co-authored-by: Dave Bunten <ekgto445@gmail.com>
@d33bs Hopefully I have attended all your comments. Also, thanks for the great questions!
I will be used this repo to explain my current understanding. From what I understand, the barcode file provides an Technically, there is no need to have a barcode for a single experiment because it will contain the same external factors among all plates (assuming that more than one plate was used in the experiment). The only time when barcodes are required is if 3 separate experiments were conducted on multiple plates. Therefore, the barcode will help find which plates have been involved with which experiment, thus mapping the correct metadata to those plates when conducting downstream analysis.
However, one needs to map the correct assay with the associated platemap, which
The barcodes only provides information that distinguishes which plates came from which experiment. Assuming that the data you are talking about came from 3 separate experiments and no barcodes were provided. A potential solution is that we contact the person who generated this dataset and asks which plates came from which experiment. However, with the scenario, if no plate maps were provided, then we will not know what types of external treatments were added to the cell and which experiments contained the types of treatments/cell lines used. Therefore, it will be difficult to simulate due to the lack of important metadata data like treatments, well positions, and cell lines used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Excited to see the testing additions! I left a few comments and suggestions throughout this review, please don't hesitate to let me know if you have a question about anything.
Also very much appreciate your answers to the barcode details! Your response seems like great documentation material potentially - would it make sense to store that information for later reference in this project?
compression_options: | ||
method: "gzip" | ||
mtime: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great findings! One way you could consider addressing understandability here would be with comments (for ex: # this line creates a global config variable used later as *VARIABLE_NAME
). I understand reasons why you might consider not using this method and defer to you on what's best.
"""Used to clean up directories in every single test run""" | ||
|
||
def __init__(self, tmp_path): | ||
self.tmp_path = tmp_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider the use of tempfile
here to assist with the creation of temporary files or directories.
@@ -62,7 +62,7 @@ def annotate_cells( | |||
barcode_platemap_df = pd.read_csv(barcodes_path) | |||
|
|||
logging.info("Searching plate map name") | |||
plate = Path(aggregated_data).name.split("_")[0] | |||
Path(aggregated_data).name.split("_")[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double checking: what will this line do - is it still necessary? If not, consider removing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, this was a mistake! thanks for pointing this out
""" | ||
|
||
# get files to transfer | ||
dataset_dir = pathlib.Path(f"./datasets/{testing_data_dir}").resolve(strict=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pytest is sometimes used to run the entirety of the "tests" directory without being located within the relative structure of sub-folders. That said, consider making sure this line (and possibly others) are able to run without changes or document the expectations of this test file here (or alternatively within the docstring for the module).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. Thanks for pointing this out. I doubled checked and added some documentation in regards to this!
# change directory to tmpdir | ||
os.chdir(tmp_path) | ||
|
||
# execute CytoSnake | ||
cmd = "cytosnake init -d *.sqlite -m metadata" | ||
proc = subprocess.run(cmd, shell=True, capture_output=True, text=True, check=False) | ||
raised_error = get_raised_error(proc.stderr) | ||
|
||
# leave testing dir | ||
os.chdir(test_module) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible, consider sending the subdirectory information directly to CytoSnake, avoiding the need to change directory both into and out of the tmpdir. This might be my own misunderstanding, so please feel free to ignore if this isn't possible or useful in the context of this test.
Alternatively, seeing how this pattern repeats, it may be useful to create a context manager for remembering to move back to a directory after changes have occurred. See this SO reference for one example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the resource. This will be considered in #49
cytosnake/guards/input_guards.py
Outdated
if not is_barcode_required: | ||
BarcodeRequiredError("Barcode is required, multiple platemaps found") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the clarification! In addition to the message, I'm wondering if the exception name itself also should reflect what is "exceptional" or "erroneous". For example, BarcodeMissingError
or similar. (this might require a change to the exception itself).
Co-authored-by: Dave Bunten <ekgto445@gmail.com>
I have applied all the changes. Merging now. If there is more work need to be done, please feel free to re-open this PR. |
About this PR
This PR adds
CytoSnake
to have logic when handling CLI user based inputs. Specifically, this update introducesbarcode logic
,By default, a barcode is not required as an input to run
CytoSnake
; however, there are some exceptions when barcodes are needed.If a user provides a dataset that has been generated from multiple experiments, then multiple plate maps are associated with the generated data. This will require a barcode file in order for
CytoSnake
to know which plate dataset is associated with which experiment.What's new?
input_guard.py
was created. This will handle all the CLI logiccheck_init_parameter_inputs()
where it takes user based parameters and checks for discrepancies.Implementation
What dictates the barcode logic is the number of plate maps found within the
metadata
data folder. If CytoSnake see's that there is more than 1 plate map, then it requires a barcode. Therefore, users must provide barcodes if multiple plate maps are present.additional Notes
Changes in workflow
Update on workflows:
cp_process_singlecells
! #37, where workflow configs were created forcp_process_singlecells
; however, workflow config was not created forcp_process
hence this PR contains a new workflow config forcp_process