Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

CNV consensus (6 of 6): Merge consensus files and name columns #403

Merged
merged 55 commits into from
Jan 8, 2020

Conversation

fingerfen
Copy link
Contributor

Purpose/implementation Section

To add the last steps to the consensus calling pipeline. These steps merge any overlapping consensus segments and name the columns

What GitHub issue does your pull request address?

#128

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.
  • This analysis is recorded in the table in analyses/README.md.

sample=config["samples"],
combined_caller=["manta_cnvkit", "manta_freec", "cnvkit_freec"],
dupdel=["dup", "del"]),
OUTPUT= expand("../../scratch/endpoints/{sample}.del_dup.merged.final.bed",
Copy link
Member

@jashapiro jashapiro Jan 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these are the final output files, we should probably put them in a results directory within the analysis directory. Perhaps results/cnv_consensus/{sample}.cnv_consensus.csv or similar.

Copy link
Contributor Author

@fingerfen fingerfen Jan 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I forgot to put them in the "results" folder. I have made the changes to reflect this.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @fingerfen, thanks for this update!

I have a number of comments associated with some of the individual steps, but here I want to step back a bit and think about what the final file we want out of this is going to look like.

I don't think we will want hundreds of individual .bed files, which seems to be the current output. Instead, what I think we ultimately want is a single .csv file that includes all samples. The files you are already producing include that information, so I think that it should be pretty straightforward to accomplish this.

If I am reading everything correctly, you may be able to reduce this to the following steps:

  1. combine and merge the paired call duplication and deletion files, using the rule at line 162 (which I suggest renaming). Adding the bedtools merge at this stage will reduce redundancy in the code.
  2. cat all the files together into a single csv file, adding the header along the way. Making use of snakemake's expand() here, you should be able to do this in a short bit of code. Something like:
rule merge_all:
    input:
        bedfiles = expand("../../scratch/endpoints/{sample}.{dupdel}.merged.final.bed",
                          sample = config["samples"],
                          dupdel = ["dup", "del"])
    output:
        "results/cnv_consensus.tsv"
    shell:
       "echo -e 'chrom\tstart\tend\tmanta_CNVs\tcnvkit_CNVs\tfreec_CNVs\tCNV_type\tsample\tfile_names' "
       " | cat - {input.bedfiles} "
       " > {output}"

Let me know if that makes sense, or if I am missing something in your current code.

If that does work out, then it will make sense to include the final results/cnv_consensus.tsv file in this PR, so you should make sure you have downloaded the current data release and you can run your pipeline on the full data set.

Along with that, I would really like to have a README file included in this PR that summarizes the approach and output of this analysis. You will also probably want to get started with a description of what you did to add to the methods section, which you can submit as a PR to https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md.

Finally, I also want to make sure you are aware of issue #392, where there is some discussion of the final output format that people might be looking for. I think the current output is a good start, but we may want to think about if there are good summaries we could add to the final calls. I would be happy to hear if you have any thoughts you could add to that discussion!

analyses/copy_number_consensus_call/Snakefile Outdated Show resolved Hide resolved
analyses/copy_number_consensus_call/Snakefile Outdated Show resolved Hide resolved
analyses/copy_number_consensus_call/Snakefile Outdated Show resolved Hide resolved
threads: 1
shell:
## Name the headers
"python3 {input.script} --file {input.combined_dup_del} > {output.final_dup_del}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a very heavy solution for just adding a single header line to the files. Is there a reason to go through python here, rather than just using echo and cat? Something like:

"echo -e 'chrom\tstart\tend\tmanta_CNVs\tcnvkit_CNVs\tfreec_CNVs\tCNV_type\tsample\tfile_names' "
" | cat - {input.combined_dup_del} > {output.final_dup_del}"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no reason. It completely escaped my mind that I could do this using echo and cat. Your suggestion is way better. I will apply this change. Thanks again!

analyses/copy_number_consensus_call/Snakefile Outdated Show resolved Hide resolved
threads: 1
shell:
## Merge the consensus files together
## Put the dup CNVs calls and del CNVs calls together into one file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain in the comments the reasons for the additional bedtools arguments? In particular, the collapse and distinct statements?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely can. I will briefly explain it here to see if it makes sense for you.

Columns 4, 5, and 6 hold the original CNV calls from Manta, CNVkit, and Freec, respectively. When we merge manta_cnvkit, manta_freec, and cnvkit_freec files together, manta_cnvkit and manta_freec will have information in column 4 (manta column). Thus, when using bedtools merging, we want to COLLAPSE these information and keep all information that is in column 4. The same applies to columns 5 and 6.

Columns 7 and 8 are the CNVtype (DEL, DUP) and Sample_name, respectively. Among the manta_cnvkit, manta_freec, and cnvkit_freec files, at this point, we should have the same CNVtypes and Sample_name. Thus we perform DISTINCT, which is to take the unique of columns 7 and 8.

As for column 9, this column holds the files that were merged to get a specific CNV. Thus we want to keep all information here so we COLLAPSE it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For columns 4, 5 and 6, will any of this result in duplication? For example, if the same original manta call was used in manta_cnvkit and manta_freec, will it now show up more than once? If so, I think we may want to clean that up in the final file, but that may require a bit of extra logic, so we would put it off to a separate PR.

"cat {input.manta_cnvkit} {input.manta_freec} {input.cnvkit_freec} | sort -k1,1 -k2,2n > {output.dup_del_file}"


rule merge:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the changes above in the combine_files rule, this rule would change to just the cat of dup and del files. As such, we might want to combine this with the column names rule, though see earlier comments about combining the .bed files.

analyses/copy_number_consensus_call/Snakefile Outdated Show resolved Hide resolved
fingerfen and others added 11 commits January 6, 2020 14:05
Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
…t_calling_updated.py

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
…t_calling_updated.py

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
@fingerfen
Copy link
Contributor Author

@jashapiro, thank you for your review and suggestions. Once again, they were very helpful. I think I have applied all of the changes you suggested. Please have a look. I am currently working on the README.md and downloading the newest data release to run the entire pipeline on it from the beginning as suggested. I will let you know when the code finish running the newly downloaded data, hopefully by tomorrow.

I will also be making changes to the methods.md file in a different PR as you suggested.

Thank you

@jashapiro
Copy link
Member

Thanks @fingerfen, this all looks good from a first look, and it passed CI. Let me know if you run into any trouble with running it on the full data set, and I look forward to seeing the README when you get it done.

@fingerfen
Copy link
Contributor Author

fingerfen commented Jan 7, 2020

@jashapiro I have updated the analyses/README.md. I have also made the suggested changes this analysis's README.md. I also added the final output file cnv_consensus.tsv. Please have a look.

For some reason CI is failing, I checked the error and I believe it doesn't have anything to do with this analysis. I think I will have to wait for the error to be resolved.

@jashapiro
Copy link
Member

jashapiro commented Jan 7, 2020

For some reason CI is failing, I checked the error and I believe it doesn't have anything to do with this analysis. I think I will have to wait for the error to be resolved.

I think CI was just getting cancelled as you added each new change. The latest commit passed CI, so everything looks good on the CI front.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I have a few clarification questions, and one where I suggest a change to a column name to correspond better with naming conventions of the project.

The other thing I was thinking about is whether to retain the filenames in the final output. My inclination is not to, as those files won't be coming along, so most users won't have them. I think the easiest way to remove them is to use cut; adding the following to the merge_all rule pipe should handle things, but check that I am right!

cut -f 1-8 

analyses/README.md Outdated Show resolved Hide resolved
analyses/copy_number_consensus_call/README.md Outdated Show resolved Hide resolved
analyses/copy_number_consensus_call/README.md Outdated Show resolved Hide resolved
analyses/copy_number_consensus_call/README.md Outdated Show resolved Hide resolved
analyses/copy_number_consensus_call/README.md Outdated Show resolved Hide resolved
analyses/copy_number_consensus_call/README.md Outdated Show resolved Hide resolved
analyses/copy_number_consensus_call/README.md Outdated Show resolved Hide resolved
analyses/copy_number_consensus_call/Snakefile Outdated Show resolved Hide resolved
analyses/copy_number_consensus_call/README.md Outdated Show resolved Hide resolved
fingerfen and others added 14 commits January 7, 2020 16:16
Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
@jashapiro
Copy link
Member

I think we are really close here, and if CI passes, we can merge this commit. However, before we call this analysis "done" I would like to add one more step to your analysis to clean things up just a bit.

I looked at the output file, and it does look like there is some duplicated information in there that could cause confusion down the line, but should be pretty straightforward to fix at this stage.
In particular, when we look at the caller-specific CNVs used for each consensus call, there can be duplicated entries, as well as extraneous NULL values that we should probably delete.

For example, look at the consensus CNV

chr15	64681606	101072404	64635340:101704930:NA,64635340:101704930:NA,NULL,NULL,NULL	 ... BS_00TRPEQX

The manta_cnv calls there are two of the same location, followed by NULL. In the other callers there are also duplicated CNVs.

It would be nice to write a script that parses through and removes duplicates and NULLs from fields 4,5 and 6 of each record to apply before combining all of the samples together into the final file.

If you have time to do that soon, let me know, or I can tackle it. To be clear, this would be a separate pull request (7 of 6, so to speak), as I think this one has enough in it already.

@fingerfen
Copy link
Contributor Author

@jashapiro I agree that the NULL should be removed and I can write a script to address that no problem. My main concern here are the duplications. Thanks for catching that. It is not supposed to contain duplicated segments. I am curious as to why it has duplicated information and I would like to go back and find out the reason for it. I will take a look at it tonight and hopefully update you about the situation tomorrow.

@fingerfen
Copy link
Contributor Author

So I found out the reason why there are duplicates. It happens during the final bedtools merging step where columns 4, 5, and 6 get COLLAPSED together. In this step, there are 3 files that get merged, manta-cnvkit, manta-freec, and cnvkit-manta. I went back to these files to find any CNVs that overlapped with chr15 64681606 101072404

When inspecting these files, the manta-cnvkit file contained a CNV that overlap with the coordinate above and it has 64635340:101704930:NA in column 4. The manta-freec file also has 1 CNV with 64635340:101704930:NA. The cnvkit-manta file contained 3 CNVs that overlaps with the coordinate above and they all have NULL in column 4. Since I merged these 3 files and COLLAPSED column 4, it resulted in 64635340:101704930:NA,64635340:101704930:NA,NULL,NULL,NULL. This all makes sense.

I think a "solution" to this is to change COLLAPSE to DISTINCT in the merging step. However, I don't think this would get rid of the duplicated segments in more complicated situations. With this, I think it might be useful to make a script to remove duplicates and NULLs from columns 4, 5, and 6.

Please let me know what you think! :)

@jashapiro
Copy link
Member

@fingerfen I suspected that was the cause, so I am glad to see you came to the same conclusion. I also agree that using DISTINCT would likely cause a different set of problems, so I think the correct solution is to include deduplication of those columns in a separate script along with NULL removal, after the bedtools merge.
I am going to approve this PR, so we can put that step in separately.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a deduplication step for columns 4,5,6 that needs to happen still, but this looks good to merge for now.

@jaclyn-taroni jaclyn-taroni merged commit 7917a7f into AlexsLemonade:master Jan 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants