Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Proposed Analysis: PCAWG WGS Brain samples to run through SNV caller pipeline #551

Closed
cansavvy opened this issue Feb 20, 2020 · 12 comments
Closed
Labels
proposed analysis snv Related to or requires SNV data

Comments

@cansavvy
Copy link
Collaborator

cansavvy commented Feb 20, 2020

What are the scientific goals of the analysis?

Following Grobner et al, 2019 we want to compare tumor mutation burden in our pediatric cohort with adult brain tumors.

This is a continuation of the goals of #257 and #481 that was originally to be used with TCGA data. However, upon running the TCGA data through the pipelines, we have encountered problems we believe may be due to its dated WXS target regions, or short reads, or shallower read depth. This data is documented in these two draft PRs: #548 and #521

Here's a summary report:
TCGAvsPBTAconsensus.pdf

What methods do you plan to use to accomplish the scientific goals?

After our video chat meeting, we discussed switching the comparison adult brain tumor data to the recently published PCAWG data.
This data has WGS samples, and is much more recent, which we hope will minimize the liftover and target region comparison issues we've been having between PBTA and TCGA data.

What input data are required for this analysis?

I'm posting this TSV file with the list of files that I believe we will want for this analysis:
pcawg_brain_wgs_samples.tsv.zip

I believe we would want the bam files listed in this file to be ran through Lancet, Strelka2, and Mutect2 in the same manner that the PBTA data was.

How I obtained this file list:

This data is on ICGC's repositories
I searched for all WGS, PCAWG study, brain samples that have BAM files for both blood and solid primary tumor

SQL Query to get this:

select(*),in(file.experimentalStrategy,'WGS'),in(file.fileFormat,'BAM'),in(file.primarySite,'Brain'),in(file.specimenType,'Primary tumour - solid tissue','Normal - blood derived'),in(file.donorStudy,'PCAWG'),in(file.id,'ES:fd7d16a5-c002-4b68-985c-b44b548a732e'),sort(-ssmAffectedGenes)

This link will also get you to this list: https://icgc.org/4ov
I exported this table as TSV and then removed the mini bam files.
These mini files appear to be file copies of the regular size bams.
I filtered those out with:

grep -vwE "mini" repository_1582233032.tsv > pcawg_brain_wgs_samples.tsv

How long do you expect is needed to complete the analysis? Will it be a multi-step analysis?

Whoever is going to be running the samples through the caller should probably answer this question.

Who will complete the analysis (please add a GitHub handle here if relevant)?

??

What relevant scientific literature relates to this analysis?

Grobner et al, 2019
PCAWG 2020 paper

@cansavvy cansavvy added proposed analysis snv Related to or requires SNV data labels Feb 20, 2020
@cansavvy
Copy link
Collaborator Author

cansavvy commented Feb 20, 2020

@jharenza
Copy link
Collaborator

@yuankunzhu

@yuankunzhu
Copy link
Collaborator

yuankunzhu commented Feb 23, 2020

@cansavvy @jharenza, the requested query contains data hosted on both EGA and PDC, for now, we only have access to PDC which is 110 BAMs from 60 donors (do we know why the total number is not 120 btw?). We can start downloading and looking at those first. And we probably need someone to submit the EGA request. Do we know what's those subjects age at diagnosis/sequencing, we might want to exclude their pediatric samples for the adult TMB calculation.

Also, on the other hand, we had previously downloaded and processed 100 WGS BAMs (50 T/N pairs) 84 PDC hosted WGS BAMs (42 T/N pairs) from ICGC-PCWAG before with query of https://icgc.org/ZFF. There're 3 subjects overlapped between the requested list and ours. @tkoganti is looking at those data's VAF and SNV classes.

@jharenza
Copy link
Collaborator

@yuankunzhu do you have the breakdown of cancer types for the 110 BAMs from 60 donors?

@cansavvy @jaclyn-taroni @cgreene - I can make a request for this data, but I am currently held up with our contracts office in approving an ICGC DACO for another project and don't have a clear idea of how long this will take. Looks like no one at CHOP has ICGC access and the office told me they wanted to make some agreement modifications, so in the meantime, should we just plan to use Mutect2/Strelka2 for these comparisons, using the TCGA data we have access to, and/or add more samples from TCGA if we do not have a good cohort of brain from PCAWG?

@yuankunzhu
Copy link
Collaborator

@jharenza I can't find the detailed cancer types for those samples. the only thing i can find from the query are their originated projects. looks like they have TCGA-LGG and GBM there?

@jharenza
Copy link
Collaborator

jharenza commented Mar 4, 2020

As an update on this, I am still working with CHOP legal to get this access request documentation approved before I can go back to ICGC to submit the final application. I should know more Thursday.

@yuankunzhu - did you mention that we lost data access to these files?

@yuankunzhu
Copy link
Collaborator

yuankunzhu commented Mar 4, 2020

@jharenza, we still have those data in the bucket, just need the DevOpt team to renew our s3 access credentials, so that we can access them on cavatica

@jvlilly
Copy link

jvlilly commented Mar 5, 2020

@stefankies can you work with allison on this^^

@yuankunzhu
Copy link
Collaborator

@jharenza @cansavvy @jvlilly @tkoganti, quick update on this, we got the data bucket access renewed and mounted that to cavatica ready.

@jharenza
Copy link
Collaborator

@yuankunzhu - were you able to process any of this data? In the meantime, I CHOP legal was working on this agreement as of 5/19. Just sent a followup.

@jharenza
Copy link
Collaborator

jharenza commented Jul 7, 2020

As an update, CHOP has approved this agreement and it was sent to ICGC on July 3 for final approval. They will respond within 15 business days.

@jharenza
Copy link
Collaborator

jharenza commented Apr 1, 2021

closing, as we still have not gotten access to these data

@jharenza jharenza closed this as completed Apr 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
proposed analysis snv Related to or requires SNV data
Projects
None yet
Development

No branches or pull requests

4 participants