Add sample metadata to unfiltered objects #400

allyhawkins · 2023-08-10T20:48:09Z

Closes #395

This PR adds sample level metadata to the unfiltered sce objects which in turn adds the metadata to all objects produced by the workflow. A sample metadata file is now an added param, similar to the run level metadata. This is then directly read into the script that generates the unfiltered SCE object.

After the sample metadata is read in and filtered to the specified sample, a list of all relevant sample-level metadata is created and added to the metadata of the SCE object. Generally, this is pretty straightforward, but I was concerned about this working for all types of libraries, including multiplexed libraries. I added a single sample_metadata list to the overall SCE metadata to deal with this. Within that list is a list of all samples, where each entry contains the metadata for a single sample. This means that to access the sample-level metadata for a sample, you would use: metadata(sce)$sample_metadata[[sample_id]].

An alternative to this would be to collapse all the values for all samples in the library to a comma-separated list for each piece of metadata, e.g., age: 5,6,7,8. The problem with that is it relies on the assumption that the order of the sample ids is the same as the order of the contents of the metadata list.

I think I favor the first option slightly more because there will always be a named list with the metadata contents for every sample, regardless of library type. But if we are not concerned about the sample metadata for multiplexed libraries, then I would be more inclined to go with the second option. Very curious what others think here!

This reverts commit b12ef46.

allyhawkins · 2023-08-10T21:22:43Z

Checks keep failing because of the following error, any ideas @jashapiro?

[d7/ad9d01] Submitted process > genetic_demux_vireo:starsolo_map:index_bam (1)
WARN: Killing pending tasks (2)
Join mismatch for the following entries: key=STUBL10 values=

jashapiro · 2023-08-10T21:45:35Z

bin/generate_unfiltered_sce.R

+sample_metadata_list <- purrr::map(sample_ids,
+                                   \(sample){
+                                     single_sample_df <- sample_metadata_df |>
+                                       dplyr::filter(scpca_sample_id %in% sample)
+                                     list(
+                                       age = single_sample_df$age,
+                                       sex = single_sample_df$sex,
+                                       diagnosis = single_sample_df$sex,
+                                       subdiagnosis = single_sample_df$subdiagnosis,
+                                       tissue_location = single_sample_df$tissue_location,
+                                       disease_timing = single_sample_df$disease_timing,
+                                       organism = single_sample_df$organism,
+                                       development_stage_ontology_term_id = single_sample_df$development_stage_ontology_term_id,
+                                       sex_ontology_term_id = single_sample_df$sex_ontology_term_id,
+                                       organism_ontology_id = single_sample_df$organism_ontology_id,
+                                       self_reported_ethnicity_ontology_term_id = single_sample_df$self_reported_ethnicity_ontology_term_id,
+                                       disease_ontology_term_id = single_sample_df$disease_ontology_term_id,
+                                       tissue_ontology_term_id = single_sample_df$tissue_ontology_term_id
+                                     )
+                                   }) |>
+  purrr::set_names(sample_ids)


Any reason not to just keep this as a data frame? Then we could just do:

sample_metadata <- sample_metadata_df |> dplyr::filter(scpca_sample_id %in% sample_ids)

If you want to, you could add row names to make selection later a bit easier, or go all the way and turn it into a DataFrame for consistency with other elements in SCE.

I hadn't done this originally because I was worried about having mostly one row dataframes. But you're right that it might make things easier down the line rather than pulling out into a list.

I think the level of complexity for access either way is about the same, so we may as well make the creation easier! Yes, 1 row data frames are strange... but this makes combining them for merging SCE objects pretty easy and logical. (Should that addition be part of this PR? I know we are not using the merging yet, but we might as well make it work with the new data)

Yes, 1 row data frames are strange... but this makes combining them for merging SCE objects pretty easy and logical. (Should that addition be part of this PR? I know we are not using the merging yet, but we might as well make it work with the new data)

What addition are you referring to? Merging SCE objects in the integration workflow?

"addition" is the wrong word, but yes, accounting for the new metadata in the integration workflow. But I see that is mostly a function in scpcaTools, so we will need to account for it there. Which makes me wonder if the code for adding sample data table to metadata should be there as well? Probably add both in a single Issue/PR.

Okay right now I changed this to add in the data frame in this script. I'll file an issue regarding making sure the merging step accounts for the sample metadata. I don't know how I feel about adding in the sample metadata as part of a function in scpcaTools. Do you mean adding another argument to read_alevin that is the sample_metadata?

I think I would just add a add_sample_metadata() function that does much of the same work as here. The idea would really be just to make the location & format of this data canonical within the scpcaTools SCE objects.

jashapiro · 2023-08-10T21:59:50Z

modules/sce-processing.nf

@@ -198,7 +200,7 @@ workflow generate_sce {
    sce_ch = quant_channel
      .map{it.toList() + [file(it[0].mito_file), file(it[0].ref_gtf)]}

-    make_unfiltered_sce(sce_ch)
+    make_unfiltered_sce(sce_ch, params.sample_metafile)


I think I found the cause of the error you are seeing. When I looked locally I got an earlier error:

DEBUG nextflow.Session - Session aborted -- Cause: Not a valid path value: 'test/stub-sample-metadata.tsv'

which pointed me here:

Suggested change

make_unfiltered_sce(sce_ch, params.sample_metafile)

make_unfiltered_sce(sce_ch, file(params.sample_metafile))

Which is to say you need to pass in the file object, not just the string. Tested locally, and it works.

Just noting I made this change in d824098 as well as a change to the GHA to make sure the logs always get uploaded for easier debugging next time!

The `cat` step in GHA wasn't running, so logs weren't uploading on failure

…mple-metadata-to-unfiltered-sce

allyhawkins · 2023-08-22T17:47:59Z

I updated this to reflect the changes in AlexsLemonade/scpcaTools#213. The one thing to note is that I had to change the column for sample id to reflect the requirements of the new add_sample_metadata() function. Do we want to keep the column renaming here? Or update AlexsLemonade/scpcaTools#213 to use scpca_sample_id as the column name @jashapiro?

jashapiro

This looks good, but I think there are a few places I think we may want/need to make some adjustments.

The first is in a comment below about the way we pass the sample info to the sub-workflow. I think we can pass the file in directly as an argument to the workflow, rather than passing the param down. I think you could do something like update the workflow in sce-processing.nf as:

workflow generate_sce {
  // generate rds files for RNA-only samples
  take: 
    quant_channel, sample_metafile
  main:
    sce_ch = quant_channel
      .map{it.toList() + [file(it[0].mito_file), file(it[0].ref_gtf)]}

    make_unfiltered_sce(sce_ch, sample_metafile)
...

And then change main.nf to include something more like:

rna_sce_ch = generate_sce(rna_quant_ch, file(params.sample_metafile))

Whether we do that or not, we do need to also update the generate_merged_sce workflow to also incorporate the sample metadata.

bin/generate_unfiltered_sce.R

jashapiro · 2023-08-22T18:13:15Z

main.nf

@@ -36,7 +36,7 @@ include { map_quant_feature } from './modules/af-features.nf' addParams(cell_bar
 include { bulk_quant_rna } from './modules/bulk-salmon.nf'
 include { genetic_demux_vireo } from './modules/genetic-demux.nf' addParams(cell_barcodes: cell_barcodes, bulk_techs: bulk_techs)
 include { spaceranger_quant } from './modules/spaceranger.nf'
-include { generate_sce; generate_merged_sce; cellhash_demux_sce; genetic_demux_sce; post_process_sce} from './modules/sce-processing.nf'
+include { generate_sce; generate_merged_sce; cellhash_demux_sce; genetic_demux_sce; post_process_sce} from './modules/sce-processing.nf' addParams(sample_metafile: params.sample_metafile)


While this works, I don't love it. I think I would rather have the generate_sce workflow take the sample_metafile as a second input and pass things through that way.

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

allyhawkins · 2023-08-22T20:38:02Z

While this works, I don't love it. I think I would rather have the generate_sce workflow take the sample_metafile as a second input and pass things through that way.

I updated the workflow accordingly to pass the sample_metafile as a second input to the generate_sce workflow.

Whether we do that or not, we do need to also update the generate_merged_sce workflow to also incorporate the sample metadata.

I also incorporated the sample metadata into the generate_merged_sce workflow.

jashapiro

LGTM, but I couldn't resist one more little tweak. I don't think it needs to be incorporated... up to you.

main.nf

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

allyhawkins · 2023-08-23T16:59:38Z

Just noting that the changes in here work, but cause an issue once we get to the conversion to AnnData. I'm planning on addressing #366 shortly so will file a PR stacked on this branch with those changes. Once everything is put back together and working we can merge it all at once.

Co-authored-by: Stephanie <stephanie.spielman@gmail.com>

…a-to-coldata Sample metadata to cell-level metadata for AnnData

…ered-sce

allyhawkins added 3 commits August 10, 2023 15:33

add sample metadata to unfiltered objects

c11e1a9

no collapsing

43a4526

update stub tests with sample metadata

23647a2

allyhawkins mentioned this pull request Aug 10, 2023

Document sample metadata as input #401

Closed

allyhawkins added 3 commits August 10, 2023 16:08

stubl12 and stubl13 to sample metadata

471faba

remove an extra 0

b12ef46

Revert "remove an extra 0"

3fd79ca

This reverts commit b12ef46.

allyhawkins requested a review from jashapiro August 10, 2023 21:22

jashapiro reviewed Aug 10, 2023

View reviewed changes

jashapiro and others added 2 commits August 11, 2023 10:07

Add file() and fix log uploads

d824098

The `cat` step in GHA wasn't running, so logs weren't uploading on failure

add sample metadata as dataframe

33d3912

allyhawkins mentioned this pull request Aug 11, 2023

Account for sample metadata in integration workflow #403

Closed

allyhawkins added 2 commits August 22, 2023 12:40

Merge remote-tracking branch 'origin/development' into allyhawkins/sa…

4ee8ff2

…mple-metadata-to-unfiltered-sce

use add_sample_metadata function

72f5aca

jashapiro reviewed Aug 22, 2023

View reviewed changes

allyhawkins and others added 2 commits August 22, 2023 14:54

Apply suggestions from code review

ba14443

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

remove passing param to workflow

ebf6ea4

allyhawkins requested a review from jashapiro August 22, 2023 20:38

jashapiro approved these changes Aug 22, 2023

View reviewed changes

main.nf Outdated Show resolved Hide resolved

main.nf Outdated Show resolved Hide resolved

main.nf Outdated Show resolved Hide resolved

Apply suggestions from code review

b1c2e5e

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

allyhawkins added 2 commits August 24, 2023 15:11

rename library id column in sample metadata

e43c8b9

incorporate sample metadata to coldata

22b6422

allyhawkins mentioned this pull request Aug 24, 2023

Sample metadata to cell-level metadata for AnnData #414

Merged

allyhawkins added 2 commits August 28, 2023 16:52

reformat sample metadata before adding it

e2d3137

account for cite seq and skip anndata creation for cell hashing

4e5e541

allyhawkins and others added 5 commits August 29, 2023 12:51

make sure library id is present in alt experiments

4bef14c

Apply suggestions from code review

8b18aab

Co-authored-by: Stephanie <stephanie.spielman@gmail.com>

check names of metadata

5d8b746

Merge pull request #414 from AlexsLemonade/allyhawkins/sample-metadat…

47224a1

…a-to-coldata Sample metadata to cell-level metadata for AnnData

Merge branch 'development' into allyhawkins/sample-metadata-to-unfilt…

213a837

…ered-sce

allyhawkins merged commit 5199e66 into development Aug 30, 2023
2 checks passed

allyhawkins deleted the allyhawkins/sample-metadata-to-unfiltered-sce branch August 30, 2023 15:51

allyhawkins mentioned this pull request Sep 19, 2023

Add sample metadata to unfiltered SCE objects #395

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sample metadata to unfiltered objects #400

Add sample metadata to unfiltered objects #400

allyhawkins commented Aug 10, 2023

allyhawkins commented Aug 10, 2023

jashapiro Aug 10, 2023

allyhawkins Aug 11, 2023

jashapiro Aug 11, 2023

allyhawkins Aug 11, 2023

jashapiro Aug 11, 2023

allyhawkins Aug 11, 2023

jashapiro Aug 11, 2023

jashapiro Aug 10, 2023 •

edited

Loading

jashapiro Aug 11, 2023 •

edited

Loading

allyhawkins Aug 11, 2023

allyhawkins commented Aug 22, 2023 •

edited

Loading

jashapiro left a comment

jashapiro Aug 22, 2023

allyhawkins commented Aug 22, 2023

jashapiro left a comment

allyhawkins commented Aug 23, 2023

	make_unfiltered_sce(sce_ch, params.sample_metafile)
	make_unfiltered_sce(sce_ch, file(params.sample_metafile))

Add sample metadata to unfiltered objects #400

Add sample metadata to unfiltered objects #400

Conversation

allyhawkins commented Aug 10, 2023

allyhawkins commented Aug 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jashapiro Aug 10, 2023 • edited Loading

Choose a reason for hiding this comment

jashapiro Aug 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allyhawkins commented Aug 22, 2023 • edited Loading

jashapiro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allyhawkins commented Aug 22, 2023

jashapiro left a comment

Choose a reason for hiding this comment

allyhawkins commented Aug 23, 2023

jashapiro Aug 10, 2023 •

edited

Loading

jashapiro Aug 11, 2023 •

edited

Loading

allyhawkins commented Aug 22, 2023 •

edited

Loading