Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Addition of R script to compare CNV caller output #142

Merged
merged 20 commits into from
Oct 22, 2019

Conversation

cbethell
Copy link
Contributor

@cbethell cbethell commented Oct 2, 2019

Purpose/implementation

The purpose of this draft PR is to produce a copy number plot showing the recurrently amplified/deleted regions detected by CNVkit and Control-FREEC.

The initial plan was to implement the TAGCNA workflow, but I encountered some issues when attempting to format the data to fit their pipeline.

The GenVisR package was used as a temporary alternative to TAGCNA and the plots produced via this package include:
analyses/cnv-comparison/plots/compare_cnv_output_frequency.pdf
analyses/cnv-comparison/plots/compare_cnv_output_proportion.pdf

The ggplot2 package was used to produce the remaining plots found in the plots directory of this module.

Issue

This PR addresses issue #8

Directions for reviewers

Per [the comment made] on the relevant issue #8, I attempted to implement the TAGCNA workflow. However, it was unsuccessful due to the specificity of the input format it expects.

I am still looking into the best way to produce figures like those in the TAGCNA documentation, but felt it necessary to show what has been produced thus far.

  1. Given the purpose of this analysis, I chose to write it in a R script. I used the sample-distribution-analysis module as a guide in this sense. That being said, would anyone argue for the need of separate scripts and a shell script to sequentially execute said scripts?

  2. I also began to include optparse functions and decided to file a draft PR of the script as is so that I can receive a second opinion on whether or not command line options are necessary for this module (again, I used the sample-distribution-analysis module as a reference).

Results

This script produces five plots as follows:
analyses/cnv-comparison/plots/compare_cnv_output_barplot_aberration.pdf
analyses/cnv-comparison/plots/compare_cnv_output_barplot_histology.pdf
analyses/cnv-comparison/plots/compare_cnv_output_boxplot.pdf
analyses/cnv-comparison/plots/compare_cnv_output_frequency.pdf
analyses/cnv-comparison/plots/compare_cnv_output_proportion.pdf

Docker and continuous integration

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

PR Checklist

  • Run a linter (not done as yet)
  • Set the seed (if applicable)
  • Comments and/or documentation up to date
  • Double check your paths

cbethell and others added 2 commits October 2, 2019 09:11
- Plots were made using GenVisR and ggplot2
- Addition of custom functions script (sourced in plotting script)
@cbethell cbethell changed the title Addition of R script to compare CNV caller output In Progress: Addition of R script to compare CNV caller output Oct 2, 2019
- added a `read_in_cnv` function
- removed `set.seed()` as it seemed unnecessary
`set.seed` no longer seems to be necessary in this script
@cbethell cbethell marked this pull request as ready for review October 7, 2019 23:52
@cbethell cbethell changed the title In Progress: Addition of R script to compare CNV caller output Addition of R script to compare CNV caller output Oct 7, 2019
Copy link
Collaborator

@cansavvy cansavvy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Very-well documented so it was pretty easy to know what was going on. I just have a few suggestions for drying things up and changing your object naming to be more specific. Let me know if you have questions about any of my comments.

Also my take on your two questions:

  1. Given the purpose of this analysis, I chose to write it in a R script. I used the sample-distribution-analysis module as a guide in this sense. That being said, would anyone argue for the need of separate scripts and a shell script to sequentially execute said scripts?
  2. I also began to include optparse functions and decided to file a draft PR of the script as is so that I can receive a second opinion on whether or not command line options are necessary for this module (again, I used the sample-distribution-analysis module as a reference).

I could see either adding more arguments to your functions and then running this from a bash script with some optparse options OR, because this consists of so many plots, making 01-cnv-comparison-plotting into an R Notebook so the plots could be seen all in one spot. I'd say one or the other. We can see what @jashapiro thinks.

# Given the file path of the CNV data, read in the data.
#
# Arg:
# input_directory: file path of input directory
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to designate this separately than the file.path? Can you just make it one file.path to the data? I found this confusing for a second.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I do see how this can be confusing. I am sure this was better suited for my original, more complicated script, but making it one file.path to the data does seem to make the most sense now.

# dataframe: the data frame containing the input data

# Read in cnv data
dataframe <-
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is for sure CNV data, can you name this dataframe something that reflects that like cnv_df or etc?

read.table(gzfile(file.path(input_directory, file_path)), header = TRUE)

# Rename the columns
dataframe <-
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dataframe <-
colnames(dataframe) <- c("chrom", "loc.start", "loc.end", "ID", "num.mark", "seg.mean")

IDK if this is really all that better but at least you don't have to say dataframe twice. Could also use dplyr::rename.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better from a memory standpoint. No need to copy the data frame.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now realize that my comment above this line was misleading as this was meant to rearrange the columns, rather than simply rename them. That being said, the function colnames won't be feasible here. I am thinking of using dplyr::select as the replacement here unless either of you may have a better solution.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, do dplyr::select/rename

# a column with the values 0 and 1 to represent loss and gain, respectively.
#
# Args:
# dataframe: data.frame containing CNV caller output
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to have the same comment here as above. dataframe doesn't tell us too much about what the object is supposed to be.

#
# Rscript analyses/cnv-comparison/util/cnv-comparison-functions.R

read_in_cnv <- function(input_directory, file_path){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the motivation for this being its own function rather than just using these steps as is?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not saying it shouldn't be a function necessarily, just wanna know the motivation. Perhaps as I read the rest of this PR it will make more sense to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this its own function due to the repetitive nature of the original code I had reading in the original and subset CNV files. In other words, this was my attempt to DRY it up a bit (which I believe can now be achieved when I implement your suggestion of putting the datasets in a list and using lapply to execute this function and subsequent functions).

#
# Return:
# barplot: barplot depicting the proportion of aberrations detected across
# chromosomes, annoated by the broad_histology variable in the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# chromosomes, annoated by the broad_histology variable in the
# chromosomes, annotated by the `broad_histology` variable in the

metadata <-
readr::read_tsv(file.path(input_directory, "pbta-histologies.tsv"))

# Read in cnvkit data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since, you've made your functions, you could put all your datasets in a list and then call the function from there so you only have to say it once.

cnv_data <- lapply(<LIST_OF_FILE_PATHS>, read_in_cv)

Now for the rest of your functions, you would only have to run it in one line:

cnv_filtered <- lapply(cnv_data, filter_segmean)

Let me know if this makes sense, if not I can expound on this idea.

# Filter the controlfreec subset data for cutoff in preparation for plotting
controlfreec_subset <- filter_segmean(controlfreec_subset)

#### Plot frequency and proportion using GenVisR -------------------------------
Copy link
Collaborator

@cansavvy cansavvy Oct 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do the same lapply strategy here. But then twice, once for proportion and once for frequency.

# Run `plot_cnFreq` for cnvkit (has to be run with subset because
# original dataset is too large for function)
cnvkit_prop_plot <-
plot_cnFreq(cnvkit_format, "proportion", "CNVkit proportion")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are always planning on running both proportion and frequency, you could make an argument in your function that runs both automatically so you don't have to repeat yourself.


##### Plot barplots using ggplot2 ----------------------------------------------

# Run `plot_histology_barplot` on cnvkit
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section could also be lapply-ed and made shorter.

@cbethell
Copy link
Contributor Author

@jashapiro I'll wait on your review before changing this script into a R notebook or running it from a bash script with optparse options. I have committed changes that address all of the other comments @cansav09 made.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Biggest comment is about fixing the order of the chromosomes by making that column a factor after reading. Just make sure you don't accidentally recode them!


# Rearrange the columns
cnv_df <- cnv_df %>%
dplyr::select("chrom", "loc.start", "loc.end", "ID", "num.mark", "seg.mean")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point (I think) you should also recode the chromosome into a factor, with logical order. You want the levels to be:
paste0("chr", c(1:22, "X", "Y"))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will do.


}

plot_cn_freq <-
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need this function? IT really just calls the GenVisR function, so it seems redundant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also see your point here, however, the motivation behind making this a function was to keep the calling of functions in the 01-cnv-comparison-plotting.R script uniformed and the number of lines of code to a minimum.

That being said, would your recommendation here be to call the GenVisR function in the 01 script nonetheless?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be my rec, yes. You should still be able to use the lapply, though you may need to add named arguments, depending on the order that GenVisR::cnFreq expects the additional options.

cbethell and others added 2 commits October 17, 2019 11:51
- removed the `plot_cnfreq` function from the functions script and implemented `GenVisR::cnFreq` function in the `01` script
- reordered the chromosome levels
dplyr::select("chrom", "loc.start", "loc.end", "ID", "num.mark", "seg.mean")

# Reorder chromosome levels
levels(cnv_df$chrom) <- c(paste0("chr", c(1:22, "X", "Y")))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above doesn't actually convert to a factor, but makes it a kind of strange hybrid character with a levels attribute:
dplyr::mutate(chrom = factor(chrom, levels = paste0("chr", c(1:22, "X", "Y"))
should work

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at the before and after plots, you can see that the order didn't actually change but the labels did. Which would be bad...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I thought it was strange when I initially tried to convert to a factor, what R claimed to already be a factor. Thanks, I'll give that a shot now.

Copy link
Member

@jashapiro jashapiro Oct 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oooh... add stringsAsFactors=FALSE to your read.table (or use readr...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also made that realization a minute ago. That'll also be in this upcoming commit 👍

- included `stringsAsFactors = FALSE` argument to `read.table` function

if (!("cowplot" %in% installed.packages())) {
install.packages("BiocManager")
BiocManager::install("cowplot")
Copy link
Member

@jashapiro jashapiro Oct 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason not to install this from cran?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I cannot pinpoint my initial motivation here so I can change this to install from cran.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good now, though I am very much confused by the results: the two callers are kind of shockingly different, and I find it very strange that ControlFREEC is calling more on the smaller chromosomes. That result makes me somewhat suspicious about what is going on, and I would like to hear more about the data there.

There is one major change that I think would be good for the ggplot-based plots, and that is to combine the two data frames with an extra column designating whether the sample is from CNVkit vs ControlFREEC, then using faceting to split the two, rather than relying on cowplot. This would have the advantage of putting the two plots on the same scale. You can do this pretty easily with dplyr::bind_rows() using the .id option to specify the new column name.

I would also like to see the results for the histology plot separated by aberration type, especially since the two methods have such different proportions of gain/loss calls.

On what I think was one of the original questions, I do think this script might probably be better off converted to a notebook, to give some space to comment a bit on these results, rather than just producing the raw plots.

meta_joined <- filtered_cnv_df %>%
dplyr::inner_join(metadata, by = c("sample" = "Kids_First_Biospecimen_ID"))

# Create barplot where the y-axis represents the size of aberration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second look, I am pretty confused by these plots. What is actually being plotted here? I had thought it was the number of each type in each chromosome, but the stat = "identity" is confusing me. Would the default stat = "count" make more sense?

On some investigation, it looks like these ultimately produce the same plots as the ones that I was expecting, but if you leave off the y aesthetic, and use the default geom_bar() (no stat) you get a much smaller file size and better y axis labels.

# Create barplot where the y-axis represents the size of aberration
barplot <- ggplot2::ggplot(filtered_cnv_df,
ggplot2::aes(x = chromosome,
y = as.factor(aberration),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See earlier comment about the y axis/stat

fill = broad_histology)) +
ggplot2::geom_bar(stat = "identity") +
ggplot2::theme_bw() +
ggplot2::theme(axis.text.y = ggplot2::element_blank())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add y axis label?

@jharenza
Copy link
Collaborator

This looks pretty good now, though I am very much confused by the results: the two callers are kind of shockingly different, and I find it very strange that ControlFREEC is calling more on the smaller chromosomes. That result makes me somewhat suspicious about what is going on, and I would like to hear more about the data there.

There is one major change that I think would be good for the ggplot-based plots, and that is to combine the two data frames with an extra column designating whether the sample is from CNVkit vs ControlFREEC, then using faceting to split the two, rather than relying on cowplot. This would have the advantage of putting the two plots on the same scale. You can do this pretty easily with dplyr::bind_rows() using the .id option to specify the new column name.

I would also like to see the results for the histology plot separated by aberration type, especially since the two methods have such different proportions of gain/loss calls.

On what I think was one of the original questions, I do think this script might probably be better off converted to a notebook, to give some space to comment a bit on these results, rather than just producing the raw plots.

@cbethell and @jashapiro - would love to see this in a notebook. Of note, we are still benchmarking (nearing the end) the CN algorithms (and have a caveat on this in the README). We had a few samples show genome-wide gain and after tweaking parameters and adding options for BAF adjustment, we have much better results. Of additional note, I thought I mentioned this, but forgive me if not, I think we would ideally use results from #128 for this ticket, though I think @cgreene and I thought this ticket could still be worked on to get the code ready for a later plug and chug. We have not created a PR for #128 yet because we realized we had to use different CN files from what we had released (containing predicted gain/loss info without the need to threshold LRR) in order to create consensus (these data will be added, hopefully in #146), but the early data is only showing ~50% consensus. We are now working on integrating Manta as a 3rd algorithm and doing a 2/3 consensus and want to toy with degree of overlap based on size of CNV, rather than 2/2. Long story short, I can help you review the results by histology if in a notebook based on what we know about certain diseases and CNV expectations!

@jaclyn-taroni
Copy link
Member

jaclyn-taroni commented Oct 18, 2019

Of additional note, I thought I mentioned this, but forgive me if not, I think we would ideally use results from #128 for this ticket, though I think @cgreene and I thought this ticket could still be worked on to get the code ready for a later plug and chug.

From #8 (comment):

I imagine the output would still be a SEG file, such that you can start with any of the SEGs we provide.

This appears to have the functionality to read in files in that format: https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/142/files#diff-fd9d53007cdfd11cf0ae23e3eede7eabR12

Long story short, I can help you review the results by histology if in a notebook based on what we know about certain diseases and CNV expectations!

What are you looking for in a notebook @jharenza? Do you want to see the plot here: cnv-comparison/plots/compare_cnv_output_barplot_histology.pdf or are you looking for additional analyses beyond:

I would also like to see the results for the histology plot separated by aberration type, especially since the two methods have such different proportions of gain/loss calls.

My inclination is to, once checked for correctness, get this merged and iterate on it. @jashapiro thoughts?

- combined data.frames and used faceting to plot ggplots
@jharenza
Copy link
Collaborator

@jaclyn-taroni ahh thanks for the link, i didn't readily see the plot - the notebook idea was just to see everything gathered together for quick visuals. My question for this analysis - is this more of a QC or do we want to turn this into a manuscript figure? If the latter, I think having this plotted in GISTIC-type format for the different histologies or as an IGV-type figure, grouped by histology, would be most effective for showing recapitulation of previous findings.

@cbethell
Copy link
Contributor Author

@jaclyn-taroni ahh thanks for the link, i didn't readily see the plot - the notebook idea was just to see everything gathered together for quick visuals. My question for this analysis - is this more of a QC or do we want to turn this into a manuscript figure? If the latter, I think having this plotted in GISTIC-type format for the different histologies or as an IGV-type figure, grouped by histology, would be most effective for showing recapitulation of previous findings.

@jharenza I can work on plotting this as an IGV-type figure. My initial attempt to apply GISTIC here was unsuccessful due to compatibility issues with my OS. However, I can give it a more thorough attempt or explore other avenues to produce this figure. I will mention you on this issue for further review once produced.

@jharenza
Copy link
Collaborator

OK - I can run GISTIC quickly on the v5 data and potentially add this to the data release, will send you the files so you can get started. I also noticed maftools updated some of their CN plotting abilities, so producing those per histology could be nice, too.

@cbethell
Copy link
Contributor Author

@jashapiro I addressed the comments you made above. Does this look ready to merge now? The changes involving GISTIC can be made and filed in a separate PR.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. One little simplifying tweak. Changing to separate gain and loss really highlights how strange the control-Freec gain calls are.

Comment on lines 68 to 69
file.path(input_directory, "pbta-cnv-cnvkit.seg.gz"),
file.path(input_directory, "pbta-cnv-controlfreec.seg.gz")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name here to save work later.

Suggested change
file.path(input_directory, "pbta-cnv-cnvkit.seg.gz"),
file.path(input_directory, "pbta-cnv-controlfreec.seg.gz")
cnvkit = file.path(input_directory, "pbta-cnv-cnvkit.seg.gz"),
controlfreec = file.path(input_directory, "pbta-cnv-controlfreec.seg.gz")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use <- assignment though please 🙂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not assignment, naming!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope I'm wrong!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎢

Comment on lines 93 to 94
combined_cnv_filtered$cnv_caller <- gsub("1", "cnvkit", combined_cnv_filtered$cnv_caller)
combined_cnv_filtered$cnv_caller <- gsub("2", "controlfreec", combined_cnv_filtered$cnv_caller)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the named list above, this should be correct automatically, so you should be able to delete these two lines.

fill = broad_histology)) +
ggplot2::geom_bar() +
ggplot2::theme_bw() +
ggplot2::facet_grid(cnv_caller~aberration) +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ggplot2::facet_grid(cnv_caller~aberration) +
ggplot2::facet_grid(cnv_caller ~ aberration) +

fill = aberration)) +
ggplot2::geom_bar() +
ggplot2::theme_bw() +
ggplot2::facet_grid(cnv_caller~aberration) +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ggplot2::facet_grid(cnv_caller~aberration) +
ggplot2::facet_grid(cnv_caller ~ aberration) +

- fixed spacing around `~` operator 
- fixed line spacing in `.Rmd` file
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! (Except for ControlFreec...)

@cbethell
Copy link
Contributor Author

Looks good! (Except for ControlFreec...)

Thank you and agreed!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants