Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sample metadata to unfiltered objects #400

Merged
merged 22 commits into from
Aug 30, 2023

Conversation

allyhawkins
Copy link
Member

Closes #395

This PR adds sample level metadata to the unfiltered sce objects which in turn adds the metadata to all objects produced by the workflow. A sample metadata file is now an added param, similar to the run level metadata. This is then directly read into the script that generates the unfiltered SCE object.

After the sample metadata is read in and filtered to the specified sample, a list of all relevant sample-level metadata is created and added to the metadata of the SCE object. Generally, this is pretty straightforward, but I was concerned about this working for all types of libraries, including multiplexed libraries. I added a single sample_metadata list to the overall SCE metadata to deal with this. Within that list is a list of all samples, where each entry contains the metadata for a single sample. This means that to access the sample-level metadata for a sample, you would use: metadata(sce)$sample_metadata[[sample_id]].

An alternative to this would be to collapse all the values for all samples in the library to a comma-separated list for each piece of metadata, e.g., age: 5,6,7,8. The problem with that is it relies on the assumption that the order of the sample ids is the same as the order of the contents of the metadata list.

I think I favor the first option slightly more because there will always be a named list with the metadata contents for every sample, regardless of library type. But if we are not concerned about the sample metadata for multiplexed libraries, then I would be more inclined to go with the second option. Very curious what others think here!

@allyhawkins
Copy link
Member Author

Checks keep failing because of the following error, any ideas @jashapiro?

[d7/ad9d01] Submitted process > genetic_demux_vireo:starsolo_map:index_bam (1)
WARN: Killing pending tasks (2)
Join mismatch for the following entries: key=STUBL10 values=

Comment on lines 144 to 164
sample_metadata_list <- purrr::map(sample_ids,
\(sample){
single_sample_df <- sample_metadata_df |>
dplyr::filter(scpca_sample_id %in% sample)
list(
age = single_sample_df$age,
sex = single_sample_df$sex,
diagnosis = single_sample_df$sex,
subdiagnosis = single_sample_df$subdiagnosis,
tissue_location = single_sample_df$tissue_location,
disease_timing = single_sample_df$disease_timing,
organism = single_sample_df$organism,
development_stage_ontology_term_id = single_sample_df$development_stage_ontology_term_id,
sex_ontology_term_id = single_sample_df$sex_ontology_term_id,
organism_ontology_id = single_sample_df$organism_ontology_id,
self_reported_ethnicity_ontology_term_id = single_sample_df$self_reported_ethnicity_ontology_term_id,
disease_ontology_term_id = single_sample_df$disease_ontology_term_id,
tissue_ontology_term_id = single_sample_df$tissue_ontology_term_id
)
}) |>
purrr::set_names(sample_ids)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to just keep this as a data frame? Then we could just do:

sample_metadata <- sample_metadata_df  |>
  dplyr::filter(scpca_sample_id %in% sample_ids)

If you want to, you could add row names to make selection later a bit easier, or go all the way and turn it into a DataFrame for consistency with other elements in SCE.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't done this originally because I was worried about having mostly one row dataframes. But you're right that it might make things easier down the line rather than pulling out into a list.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the level of complexity for access either way is about the same, so we may as well make the creation easier! Yes, 1 row data frames are strange... but this makes combining them for merging SCE objects pretty easy and logical. (Should that addition be part of this PR? I know we are not using the merging yet, but we might as well make it work with the new data)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, 1 row data frames are strange... but this makes combining them for merging SCE objects pretty easy and logical. (Should that addition be part of this PR? I know we are not using the merging yet, but we might as well make it work with the new data)

What addition are you referring to? Merging SCE objects in the integration workflow?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"addition" is the wrong word, but yes, accounting for the new metadata in the integration workflow. But I see that is mostly a function in scpcaTools, so we will need to account for it there. Which makes me wonder if the code for adding sample data table to metadata should be there as well? Probably add both in a single Issue/PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay right now I changed this to add in the data frame in this script. I'll file an issue regarding making sure the merging step accounts for the sample metadata. I don't know how I feel about adding in the sample metadata as part of a function in scpcaTools. Do you mean adding another argument to read_alevin that is the sample_metadata?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would just add a add_sample_metadata() function that does much of the same work as here. The idea would really be just to make the location & format of this data canonical within the scpcaTools SCE objects.

@@ -198,7 +200,7 @@ workflow generate_sce {
sce_ch = quant_channel
.map{it.toList() + [file(it[0].mito_file), file(it[0].ref_gtf)]}

make_unfiltered_sce(sce_ch)
make_unfiltered_sce(sce_ch, params.sample_metafile)
Copy link
Member

@jashapiro jashapiro Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I found the cause of the error you are seeing. When I looked locally I got an earlier error:

DEBUG nextflow.Session - Session aborted -- Cause: Not a valid path value: 'test/stub-sample-metadata.tsv'

which pointed me here:

Suggested change
make_unfiltered_sce(sce_ch, params.sample_metafile)
make_unfiltered_sce(sce_ch, file(params.sample_metafile))

Which is to say you need to pass in the file object, not just the string. Tested locally, and it works.

Copy link
Member

@jashapiro jashapiro Aug 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting I made this change in d824098 as well as a change to the GHA to make sure the logs always get uploaded for easier debugging next time!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

jashapiro and others added 2 commits August 11, 2023 10:07
The `cat` step in GHA  wasn't running, so logs weren't uploading on failure
@allyhawkins
Copy link
Member Author

allyhawkins commented Aug 22, 2023

I updated this to reflect the changes in AlexsLemonade/scpcaTools#213. The one thing to note is that I had to change the column for sample id to reflect the requirements of the new add_sample_metadata() function. Do we want to keep the column renaming here? Or update AlexsLemonade/scpcaTools#213 to use scpca_sample_id as the column name @jashapiro?

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, but I think there are a few places I think we may want/need to make some adjustments.

The first is in a comment below about the way we pass the sample info to the sub-workflow. I think we can pass the file in directly as an argument to the workflow, rather than passing the param down. I think you could do something like update the workflow in sce-processing.nf as:

workflow generate_sce {
  // generate rds files for RNA-only samples
  take: 
    quant_channel, sample_metafile
  main:
    sce_ch = quant_channel
      .map{it.toList() + [file(it[0].mito_file), file(it[0].ref_gtf)]}

    make_unfiltered_sce(sce_ch, sample_metafile)
...

And then change main.nf to include something more like:

rna_sce_ch = generate_sce(rna_quant_ch, file(params.sample_metafile))

Whether we do that or not, we do need to also update the generate_merged_sce workflow to also incorporate the sample metadata.

bin/generate_unfiltered_sce.R Outdated Show resolved Hide resolved
main.nf Outdated
@@ -36,7 +36,7 @@ include { map_quant_feature } from './modules/af-features.nf' addParams(cell_bar
include { bulk_quant_rna } from './modules/bulk-salmon.nf'
include { genetic_demux_vireo } from './modules/genetic-demux.nf' addParams(cell_barcodes: cell_barcodes, bulk_techs: bulk_techs)
include { spaceranger_quant } from './modules/spaceranger.nf'
include { generate_sce; generate_merged_sce; cellhash_demux_sce; genetic_demux_sce; post_process_sce} from './modules/sce-processing.nf'
include { generate_sce; generate_merged_sce; cellhash_demux_sce; genetic_demux_sce; post_process_sce} from './modules/sce-processing.nf' addParams(sample_metafile: params.sample_metafile)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this works, I don't love it. I think I would rather have the generate_sce workflow take the sample_metafile as a second input and pass things through that way.

allyhawkins and others added 2 commits August 22, 2023 14:54
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
@allyhawkins
Copy link
Member Author

While this works, I don't love it. I think I would rather have the generate_sce workflow take the sample_metafile as a second input and pass things through that way.

I updated the workflow accordingly to pass the sample_metafile as a second input to the generate_sce workflow.

Whether we do that or not, we do need to also update the generate_merged_sce workflow to also incorporate the sample metadata.

I also incorporated the sample metadata into the generate_merged_sce workflow.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I couldn't resist one more little tweak. I don't think it needs to be incorporated... up to you.

main.nf Outdated Show resolved Hide resolved
main.nf Outdated Show resolved Hide resolved
main.nf Outdated Show resolved Hide resolved
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
@allyhawkins
Copy link
Member Author

Just noting that the changes in here work, but cause an issue once we get to the conversion to AnnData. I'm planning on addressing #366 shortly so will file a PR stacked on this branch with those changes. Once everything is put back together and working we can merge it all at once.

@allyhawkins allyhawkins merged commit 5199e66 into development Aug 30, 2023
2 checks passed
@allyhawkins allyhawkins deleted the allyhawkins/sample-metadata-to-unfiltered-sce branch August 30, 2023 15:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants