-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sample metadata to unfiltered objects #400
Add sample metadata to unfiltered objects #400
Conversation
Checks keep failing because of the following error, any ideas @jashapiro?
|
bin/generate_unfiltered_sce.R
Outdated
sample_metadata_list <- purrr::map(sample_ids, | ||
\(sample){ | ||
single_sample_df <- sample_metadata_df |> | ||
dplyr::filter(scpca_sample_id %in% sample) | ||
list( | ||
age = single_sample_df$age, | ||
sex = single_sample_df$sex, | ||
diagnosis = single_sample_df$sex, | ||
subdiagnosis = single_sample_df$subdiagnosis, | ||
tissue_location = single_sample_df$tissue_location, | ||
disease_timing = single_sample_df$disease_timing, | ||
organism = single_sample_df$organism, | ||
development_stage_ontology_term_id = single_sample_df$development_stage_ontology_term_id, | ||
sex_ontology_term_id = single_sample_df$sex_ontology_term_id, | ||
organism_ontology_id = single_sample_df$organism_ontology_id, | ||
self_reported_ethnicity_ontology_term_id = single_sample_df$self_reported_ethnicity_ontology_term_id, | ||
disease_ontology_term_id = single_sample_df$disease_ontology_term_id, | ||
tissue_ontology_term_id = single_sample_df$tissue_ontology_term_id | ||
) | ||
}) |> | ||
purrr::set_names(sample_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to just keep this as a data frame? Then we could just do:
sample_metadata <- sample_metadata_df |>
dplyr::filter(scpca_sample_id %in% sample_ids)
If you want to, you could add row names to make selection later a bit easier, or go all the way and turn it into a DataFrame
for consistency with other elements in SCE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hadn't done this originally because I was worried about having mostly one row dataframes. But you're right that it might make things easier down the line rather than pulling out into a list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the level of complexity for access either way is about the same, so we may as well make the creation easier! Yes, 1 row data frames are strange... but this makes combining them for merging SCE objects pretty easy and logical. (Should that addition be part of this PR? I know we are not using the merging yet, but we might as well make it work with the new data)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, 1 row data frames are strange... but this makes combining them for merging SCE objects pretty easy and logical. (Should that addition be part of this PR? I know we are not using the merging yet, but we might as well make it work with the new data)
What addition are you referring to? Merging SCE objects in the integration workflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"addition" is the wrong word, but yes, accounting for the new metadata in the integration workflow. But I see that is mostly a function in scpcaTools
, so we will need to account for it there. Which makes me wonder if the code for adding sample data table to metadata should be there as well? Probably add both in a single Issue/PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay right now I changed this to add in the data frame in this script. I'll file an issue regarding making sure the merging step accounts for the sample metadata. I don't know how I feel about adding in the sample metadata as part of a function in scpcaTools. Do you mean adding another argument to read_alevin
that is the sample_metadata
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would just add a add_sample_metadata()
function that does much of the same work as here. The idea would really be just to make the location & format of this data canonical within the scpcaTools SCE objects.
modules/sce-processing.nf
Outdated
@@ -198,7 +200,7 @@ workflow generate_sce { | |||
sce_ch = quant_channel | |||
.map{it.toList() + [file(it[0].mito_file), file(it[0].ref_gtf)]} | |||
|
|||
make_unfiltered_sce(sce_ch) | |||
make_unfiltered_sce(sce_ch, params.sample_metafile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I found the cause of the error you are seeing. When I looked locally I got an earlier error:
DEBUG nextflow.Session - Session aborted -- Cause: Not a valid path value: 'test/stub-sample-metadata.tsv'
which pointed me here:
make_unfiltered_sce(sce_ch, params.sample_metafile) | |
make_unfiltered_sce(sce_ch, file(params.sample_metafile)) |
Which is to say you need to pass in the file object, not just the string. Tested locally, and it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting I made this change in d824098 as well as a change to the GHA to make sure the logs always get uploaded for easier debugging next time!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
The `cat` step in GHA wasn't running, so logs weren't uploading on failure
…mple-metadata-to-unfiltered-sce
I updated this to reflect the changes in AlexsLemonade/scpcaTools#213. The one thing to note is that I had to change the column for sample id to reflect the requirements of the new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, but I think there are a few places I think we may want/need to make some adjustments.
The first is in a comment below about the way we pass the sample info to the sub-workflow. I think we can pass the file in directly as an argument to the workflow, rather than passing the param
down. I think you could do something like update the workflow in sce-processing.nf
as:
workflow generate_sce {
// generate rds files for RNA-only samples
take:
quant_channel, sample_metafile
main:
sce_ch = quant_channel
.map{it.toList() + [file(it[0].mito_file), file(it[0].ref_gtf)]}
make_unfiltered_sce(sce_ch, sample_metafile)
...
And then change main.nf
to include something more like:
rna_sce_ch = generate_sce(rna_quant_ch, file(params.sample_metafile))
Whether we do that or not, we do need to also update the generate_merged_sce
workflow to also incorporate the sample metadata.
main.nf
Outdated
@@ -36,7 +36,7 @@ include { map_quant_feature } from './modules/af-features.nf' addParams(cell_bar | |||
include { bulk_quant_rna } from './modules/bulk-salmon.nf' | |||
include { genetic_demux_vireo } from './modules/genetic-demux.nf' addParams(cell_barcodes: cell_barcodes, bulk_techs: bulk_techs) | |||
include { spaceranger_quant } from './modules/spaceranger.nf' | |||
include { generate_sce; generate_merged_sce; cellhash_demux_sce; genetic_demux_sce; post_process_sce} from './modules/sce-processing.nf' | |||
include { generate_sce; generate_merged_sce; cellhash_demux_sce; genetic_demux_sce; post_process_sce} from './modules/sce-processing.nf' addParams(sample_metafile: params.sample_metafile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this works, I don't love it. I think I would rather have the generate_sce
workflow take the sample_metafile
as a second input and pass things through that way.
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
I updated the workflow accordingly to pass the
I also incorporated the sample metadata into the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but I couldn't resist one more little tweak. I don't think it needs to be incorporated... up to you.
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
Just noting that the changes in here work, but cause an issue once we get to the conversion to AnnData. I'm planning on addressing #366 shortly so will file a PR stacked on this branch with those changes. Once everything is put back together and working we can merge it all at once. |
Co-authored-by: Stephanie <stephanie.spielman@gmail.com>
…a-to-coldata Sample metadata to cell-level metadata for AnnData
Closes #395
This PR adds sample level metadata to the unfiltered sce objects which in turn adds the metadata to all objects produced by the workflow. A sample metadata file is now an added param, similar to the run level metadata. This is then directly read into the script that generates the unfiltered SCE object.
After the sample metadata is read in and filtered to the specified sample, a list of all relevant sample-level metadata is created and added to the metadata of the SCE object. Generally, this is pretty straightforward, but I was concerned about this working for all types of libraries, including multiplexed libraries. I added a single
sample_metadata
list to the overall SCE metadata to deal with this. Within that list is a list of all samples, where each entry contains the metadata for a single sample. This means that to access the sample-level metadata for a sample, you would use:metadata(sce)$sample_metadata[[sample_id]]
.An alternative to this would be to collapse all the values for all samples in the library to a comma-separated list for each piece of metadata, e.g.,
age: 5,6,7,8
. The problem with that is it relies on the assumption that the order of the sample ids is the same as the order of the contents of the metadata list.I think I favor the first option slightly more because there will always be a named list with the metadata contents for every sample, regardless of library type. But if we are not concerned about the sample metadata for multiplexed libraries, then I would be more inclined to go with the second option. Very curious what others think here!