-
Notifications
You must be signed in to change notification settings - Fork 67
Addition of R script to compare CNV caller output #142
Addition of R script to compare CNV caller output #142
Conversation
- Plots were made using GenVisR and ggplot2 - Addition of custom functions script (sourced in plotting script)
- added a `read_in_cnv` function - removed `set.seed()` as it seemed unnecessary
…v-comparison-plotting
`set.seed` no longer seems to be necessary in this script
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Very-well documented so it was pretty easy to know what was going on. I just have a few suggestions for drying things up and changing your object naming to be more specific. Let me know if you have questions about any of my comments.
Also my take on your two questions:
- Given the purpose of this analysis, I chose to write it in a R script. I used the sample-distribution-analysis module as a guide in this sense. That being said, would anyone argue for the need of separate scripts and a shell script to sequentially execute said scripts?
- I also began to include optparse functions and decided to file a draft PR of the script as is so that I can receive a second opinion on whether or not command line options are necessary for this module (again, I used the sample-distribution-analysis module as a reference).
I could see either adding more arguments to your functions and then running this from a bash script with some optparse
options OR, because this consists of so many plots, making 01-cnv-comparison-plotting
into an R Notebook so the plots could be seen all in one spot. I'd say one or the other. We can see what @jashapiro thinks.
# Given the file path of the CNV data, read in the data. | ||
# | ||
# Arg: | ||
# input_directory: file path of input directory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to designate this separately than the file.path
? Can you just make it one file.path
to the data? I found this confusing for a second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I do see how this can be confusing. I am sure this was better suited for my original, more complicated script, but making it one file.path
to the data does seem to make the most sense now.
# dataframe: the data frame containing the input data | ||
|
||
# Read in cnv data | ||
dataframe <- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is for sure CNV data, can you name this dataframe
something that reflects that like cnv_df
or etc?
read.table(gzfile(file.path(input_directory, file_path)), header = TRUE) | ||
|
||
# Rename the columns | ||
dataframe <- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataframe <- | |
colnames(dataframe) <- c("chrom", "loc.start", "loc.end", "ID", "num.mark", "seg.mean") |
IDK if this is really all that better but at least you don't have to say dataframe
twice. Could also use dplyr::rename
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better from a memory standpoint. No need to copy the data frame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now realize that my comment above this line was misleading as this was meant to rearrange the columns, rather than simply rename them. That being said, the function colnames
won't be feasible here. I am thinking of using dplyr::select
as the replacement here unless either of you may have a better solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, do dplyr::select/rename
# a column with the values 0 and 1 to represent loss and gain, respectively. | ||
# | ||
# Args: | ||
# dataframe: data.frame containing CNV caller output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to have the same comment here as above. dataframe
doesn't tell us too much about what the object is supposed to be.
# | ||
# Rscript analyses/cnv-comparison/util/cnv-comparison-functions.R | ||
|
||
read_in_cnv <- function(input_directory, file_path){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the motivation for this being its own function rather than just using these steps as is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not saying it shouldn't be a function necessarily, just wanna know the motivation. Perhaps as I read the rest of this PR it will make more sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this its own function due to the repetitive nature of the original code I had reading in the original and subset CNV files. In other words, this was my attempt to DRY it up a bit (which I believe can now be achieved when I implement your suggestion of putting the datasets in a list and using lapply to execute this function and subsequent functions).
# | ||
# Return: | ||
# barplot: barplot depicting the proportion of aberrations detected across | ||
# chromosomes, annoated by the broad_histology variable in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# chromosomes, annoated by the broad_histology variable in the | |
# chromosomes, annotated by the `broad_histology` variable in the |
metadata <- | ||
readr::read_tsv(file.path(input_directory, "pbta-histologies.tsv")) | ||
|
||
# Read in cnvkit data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since, you've made your functions, you could put all your datasets in a list and then call the function from there so you only have to say it once.
cnv_data <- lapply(<LIST_OF_FILE_PATHS>, read_in_cv)
Now for the rest of your functions, you would only have to run it in one line:
cnv_filtered <- lapply(cnv_data, filter_segmean)
Let me know if this makes sense, if not I can expound on this idea.
# Filter the controlfreec subset data for cutoff in preparation for plotting | ||
controlfreec_subset <- filter_segmean(controlfreec_subset) | ||
|
||
#### Plot frequency and proportion using GenVisR ------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could do the same lapply strategy here. But then twice, once for proportion
and once for frequency
.
# Run `plot_cnFreq` for cnvkit (has to be run with subset because | ||
# original dataset is too large for function) | ||
cnvkit_prop_plot <- | ||
plot_cnFreq(cnvkit_format, "proportion", "CNVkit proportion") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are always planning on running both proportion and frequency, you could make an argument in your function that runs both automatically so you don't have to repeat yourself.
|
||
##### Plot barplots using ggplot2 ---------------------------------------------- | ||
|
||
# Run `plot_histology_barplot` on cnvkit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section could also be lapply
-ed and made shorter.
…into cnv-comparison-plotting
@jashapiro I'll wait on your review before changing this script into a R notebook or running it from a bash script with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Biggest comment is about fixing the order of the chromosomes by making that column a factor after reading. Just make sure you don't accidentally recode them!
|
||
# Rearrange the columns | ||
cnv_df <- cnv_df %>% | ||
dplyr::select("chrom", "loc.start", "loc.end", "ID", "num.mark", "seg.mean") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point (I think) you should also recode the chromosome into a factor, with logical order. You want the levels to be:
paste0("chr", c(1:22, "X", "Y"))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, will do.
|
||
} | ||
|
||
plot_cn_freq <- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need this function? IT really just calls the GenVisR function, so it seems redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also see your point here, however, the motivation behind making this a function was to keep the calling of functions in the 01-cnv-comparison-plotting.R
script uniformed and the number of lines of code to a minimum.
That being said, would your recommendation here be to call the GenVisR
function in the 01
script nonetheless?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would be my rec, yes. You should still be able to use the lapply, though you may need to add named arguments, depending on the order that GenVisR::cnFreq
expects the additional options.
- removed the `plot_cnfreq` function from the functions script and implemented `GenVisR::cnFreq` function in the `01` script - reordered the chromosome levels
dplyr::select("chrom", "loc.start", "loc.end", "ID", "num.mark", "seg.mean") | ||
|
||
# Reorder chromosome levels | ||
levels(cnv_df$chrom) <- c(paste0("chr", c(1:22, "X", "Y"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above doesn't actually convert to a factor, but makes it a kind of strange hybrid character with a levels attribute:
dplyr::mutate(chrom = factor(chrom, levels = paste0("chr", c(1:22, "X", "Y"))
should work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you look at the before and after plots, you can see that the order didn't actually change but the labels did. Which would be bad...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, I thought it was strange when I initially tried to convert to a factor, what R claimed to already be a factor. Thanks, I'll give that a shot now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oooh... add stringsAsFactors=FALSE
to your read.table (or use readr...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also made that realization a minute ago. That'll also be in this upcoming commit 👍
- included `stringsAsFactors = FALSE` argument to `read.table` function
|
||
if (!("cowplot" %in% installed.packages())) { | ||
install.packages("BiocManager") | ||
BiocManager::install("cowplot") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason not to install this from cran?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, I cannot pinpoint my initial motivation here so I can change this to install from cran.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good now, though I am very much confused by the results: the two callers are kind of shockingly different, and I find it very strange that ControlFREEC is calling more on the smaller chromosomes. That result makes me somewhat suspicious about what is going on, and I would like to hear more about the data there.
There is one major change that I think would be good for the ggplot-based plots, and that is to combine the two data frames with an extra column designating whether the sample is from CNVkit vs ControlFREEC, then using faceting to split the two, rather than relying on cowplot. This would have the advantage of putting the two plots on the same scale. You can do this pretty easily with dplyr::bind_rows()
using the .id
option to specify the new column name.
I would also like to see the results for the histology plot separated by aberration type, especially since the two methods have such different proportions of gain/loss calls.
On what I think was one of the original questions, I do think this script might probably be better off converted to a notebook, to give some space to comment a bit on these results, rather than just producing the raw plots.
meta_joined <- filtered_cnv_df %>% | ||
dplyr::inner_join(metadata, by = c("sample" = "Kids_First_Biospecimen_ID")) | ||
|
||
# Create barplot where the y-axis represents the size of aberration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second look, I am pretty confused by these plots. What is actually being plotted here? I had thought it was the number of each type in each chromosome, but the stat = "identity"
is confusing me. Would the default stat = "count"
make more sense?
On some investigation, it looks like these ultimately produce the same plots as the ones that I was expecting, but if you leave off the y
aesthetic, and use the default geom_bar()
(no stat) you get a much smaller file size and better y axis labels.
# Create barplot where the y-axis represents the size of aberration | ||
barplot <- ggplot2::ggplot(filtered_cnv_df, | ||
ggplot2::aes(x = chromosome, | ||
y = as.factor(aberration), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See earlier comment about the y axis/stat
fill = broad_histology)) + | ||
ggplot2::geom_bar(stat = "identity") + | ||
ggplot2::theme_bw() + | ||
ggplot2::theme(axis.text.y = ggplot2::element_blank()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add y axis label?
@cbethell and @jashapiro - would love to see this in a notebook. Of note, we are still benchmarking (nearing the end) the CN algorithms (and have a caveat on this in the README). We had a few samples show genome-wide gain and after tweaking parameters and adding options for BAF adjustment, we have much better results. Of additional note, I thought I mentioned this, but forgive me if not, I think we would ideally use results from #128 for this ticket, though I think @cgreene and I thought this ticket could still be worked on to get the code ready for a later plug and chug. We have not created a PR for #128 yet because we realized we had to use different CN files from what we had released (containing predicted gain/loss info without the need to threshold LRR) in order to create consensus (these data will be added, hopefully in #146), but the early data is only showing ~50% consensus. We are now working on integrating Manta as a 3rd algorithm and doing a 2/3 consensus and want to toy with degree of overlap based on size of CNV, rather than 2/2. Long story short, I can help you review the results by histology if in a notebook based on what we know about certain diseases and CNV expectations! |
From #8 (comment):
This appears to have the functionality to read in files in that format: https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/142/files#diff-fd9d53007cdfd11cf0ae23e3eede7eabR12
What are you looking for in a notebook @jharenza? Do you want to see the plot here:
My inclination is to, once checked for correctness, get this merged and iterate on it. @jashapiro thoughts? |
…into cnv-comparison-plotting
…/OpenPBTA-analysis into cnv-comparison-plotting
- combined data.frames and used faceting to plot ggplots
@jaclyn-taroni ahh thanks for the link, i didn't readily see the plot - the notebook idea was just to see everything gathered together for quick visuals. My question for this analysis - is this more of a QC or do we want to turn this into a manuscript figure? If the latter, I think having this plotted in GISTIC-type format for the different histologies or as an IGV-type figure, grouped by histology, would be most effective for showing recapitulation of previous findings. |
@jharenza I can work on plotting this as an IGV-type figure. My initial attempt to apply GISTIC here was unsuccessful due to compatibility issues with my OS. However, I can give it a more thorough attempt or explore other avenues to produce this figure. I will mention you on this issue for further review once produced. |
OK - I can run GISTIC quickly on the v5 data and potentially add this to the data release, will send you the files so you can get started. I also noticed maftools updated some of their CN plotting abilities, so producing those per histology could be nice, too. |
@jashapiro I addressed the comments you made above. Does this look ready to merge now? The changes involving GISTIC can be made and filed in a separate PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. One little simplifying tweak. Changing to separate gain and loss really highlights how strange the control-Freec gain calls are.
file.path(input_directory, "pbta-cnv-cnvkit.seg.gz"), | ||
file.path(input_directory, "pbta-cnv-controlfreec.seg.gz") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name here to save work later.
file.path(input_directory, "pbta-cnv-cnvkit.seg.gz"), | |
file.path(input_directory, "pbta-cnv-controlfreec.seg.gz") | |
cnvkit = file.path(input_directory, "pbta-cnv-cnvkit.seg.gz"), | |
controlfreec = file.path(input_directory, "pbta-cnv-controlfreec.seg.gz") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use <-
assignment though please 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not assignment, naming!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope I'm wrong!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎢
combined_cnv_filtered$cnv_caller <- gsub("1", "cnvkit", combined_cnv_filtered$cnv_caller) | ||
combined_cnv_filtered$cnv_caller <- gsub("2", "controlfreec", combined_cnv_filtered$cnv_caller) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the named list above, this should be correct automatically, so you should be able to delete these two lines.
fill = broad_histology)) + | ||
ggplot2::geom_bar() + | ||
ggplot2::theme_bw() + | ||
ggplot2::facet_grid(cnv_caller~aberration) + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ggplot2::facet_grid(cnv_caller~aberration) + | |
ggplot2::facet_grid(cnv_caller ~ aberration) + |
fill = aberration)) + | ||
ggplot2::geom_bar() + | ||
ggplot2::theme_bw() + | ||
ggplot2::facet_grid(cnv_caller~aberration) + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ggplot2::facet_grid(cnv_caller~aberration) + | |
ggplot2::facet_grid(cnv_caller ~ aberration) + |
- fixed spacing around `~` operator - fixed line spacing in `.Rmd` file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! (Except for ControlFreec...)
Thank you and agreed! |
Purpose/implementation
The purpose of this draft PR is to produce a copy number plot showing the recurrently amplified/deleted regions detected by CNVkit and Control-FREEC.
The initial plan was to implement the TAGCNA workflow, but I encountered some issues when attempting to format the data to fit their pipeline.
The
GenVisR
package was used as a temporary alternative to TAGCNA and the plots produced via this package include:analyses/cnv-comparison/plots/compare_cnv_output_frequency.pdf
analyses/cnv-comparison/plots/compare_cnv_output_proportion.pdf
The
ggplot2
package was used to produce the remaining plots found in theplots
directory of this module.Issue
This PR addresses issue #8
Directions for reviewers
Per [the comment made] on the relevant issue #8, I attempted to implement the TAGCNA workflow. However, it was unsuccessful due to the specificity of the input format it expects.
I am still looking into the best way to produce figures like those in the TAGCNA documentation, but felt it necessary to show what has been produced thus far.
Given the purpose of this analysis, I chose to write it in a R script. I used the
sample-distribution-analysis
module as a guide in this sense. That being said, would anyone argue for the need of separate scripts and a shell script to sequentially execute said scripts?I also began to include
optparse
functions and decided to file a draft PR of the script as is so that I can receive a second opinion on whether or not command line options are necessary for this module (again, I used thesample-distribution-analysis
module as a reference).Results
This script produces five plots as follows:
analyses/cnv-comparison/plots/compare_cnv_output_barplot_aberration.pdf
analyses/cnv-comparison/plots/compare_cnv_output_barplot_histology.pdf
analyses/cnv-comparison/plots/compare_cnv_output_boxplot.pdf
analyses/cnv-comparison/plots/compare_cnv_output_frequency.pdf
analyses/cnv-comparison/plots/compare_cnv_output_proportion.pdf
Docker and continuous integration
PR Checklist