Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Update CNV segment to gene mapping: support both formats, use GTF, etc. #253

Merged
merged 14 commits into from
Nov 9, 2019

Conversation

jaclyn-taroni
Copy link
Member

@jaclyn-taroni jaclyn-taroni commented Nov 8, 2019

Purpose/implementation

Here I'm making a number of updates to the focal CN file preparation:

  • Adding a notebook that includes ploidy and status information in the CNVkit file ( analyses/focal-cn-file-preparation/00-add-ploidy-cnvkit.Rmd)
  • Updating analyses/focal-cn-file-preparation/01-prepare-cn-file.R
    • Use either ControlFreeC output from the data download or the CNVkit output from the above notebook
    • Use the GTF annotation file to make a GenomicRanges object
    • Merge using exons rather than genes
    • Map from Ensembl gene identifiers (result of exons change) to gene symbols and cytobands

Issue

Related to #186 and to #217

Directions for reviewers

  • The sections for finding overlaps and mapping to different identifiers of the updated 01-prepare-cn-file.R should be looked at carefully.
  • It's possible the mapping to gene symbols should use the information in the GTF, rather than the org.Hs.eg.db package, but this was very easy to implement. Thoughts?
  • Are we okay with the new tabular format (see below)?

Results

No results yet, but I did make some scientific decisions worth noting here regarding defining gain and loss broadly in the CNVkit output : a) if copy number is less than ploidy, something is marked as a loss b) if copy number is greater than ploidy, something is marked as gain

In analyses/focal-cn-file-preparation/01-prepare-cn-file.R, I change any gain where copy number is greater than (ploidy * 2) to be labeled as an amplification.

The output file format is now:

biospecimen_id	status	copy_number	ploidy	gene_symbol	cytoband	ensembl
BS_007JTNB8	gain	3	2	DDX11L1	1p36.33	ENSG00000223972
BS_007JTNB8	gain	3	2	SH2D5	1p36.12	ENSG00000189410
BS_007JTNB8	gain	3	2	CSF3R	1p34.3	ENSG00000119535
BS_007JTNB8	gain	3	2	CC2D1B	1p32.3	ENSG00000154222
BS_007JTNB8	gain	3	2	FCGR1CP	1q21.1	ENSG00000265531

which is a bit different from what was discussed on #186

Docker and continuous integration

This has already been addressed in earlier pull requests.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just one question about whether we need to filter chromosomes with the change to exons and a documentation note.
(Also, note that I updated the PR to fix gain/loss status info there)

Comment on lines 168 to 169
chroms <- paste0("chr", 1:22)
chrom_filter <- list(tx_chrom = chroms)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With GenomicFeatures::exons(), this should not be necessary, as no exon is mapped to multiple/alternate chromosomes. I don't know if any of the calls fall on non-canonical chromosomes, but we might not want to exclude them at this point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought was it might be more efficient if we drop anything outside this filter, but I have no evidence whatsoever to suggest that I am right about that. I will take it out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay - this step in CI takes much longer (granted I made other changes), but I'm thinking of implementing a filter for CI only per https://github.com/AlexsLemonade/OpenPBTA-analysis#passing-variables-only-in-ci.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To close the loop - that wasn't the issue and I was looking at the wrong branch 🙃 I will leave in the filtering changes though

@jaclyn-taroni
Copy link
Member Author

New format is:

biospecimen_id	status	copy_number	ploidy	ensembl	gene_symbol	cytoband
BS_007JTNB8	gain	3	2	ENSG00000223972	DDX11L1	1p36.33
BS_007JTNB8	gain	3	2	ENSG00000189410	SH2D5	1p36.12
BS_007JTNB8	gain	3	2	ENSG00000119535	CSF3R	1p34.3
BS_007JTNB8	gain	3	2	ENSG00000154222	CC2D1B	1p32.3
BS_007JTNB8	gain	3	2	ENSG00000265531	FCGR1CP	1q21.1

@jaclyn-taroni jaclyn-taroni merged commit 713d2b8 into AlexsLemonade:master Nov 9, 2019
@jaclyn-taroni jaclyn-taroni deleted the 186-both-formats branch November 9, 2019 20:44
@cbethell cbethell mentioned this pull request Nov 14, 2019
8 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants