diff --git a/README.md b/README.md index a0e53a4..79f0cf4 100644 --- a/README.md +++ b/README.md @@ -1,9 +1,9 @@ -# Review of asexual genomes +# Review of parthenogenetic genomes -This repository serves for the analyses performed for a metastudy of asexual genomes. +This is a repository of analyses performed in [Genomic features of parthenogenetic animals](https://www.biorxiv.org/content/10.1101/497495v2). -The idea is to review all genomes of asexual animals and compare patterns observed. Other eukaryotes will be only discussed. -One of difficulties is to compare different genomics projects that are based on different inference methods and focus on different aspects. Therefore we estimate the most of the genomic properties using unified methodology. +The idea is to review all genomes of asexual animals and compare patterns observed. +One of difficulties is to compare different genomics projects that are based on different inference methods and focus on different aspects. Therefore we estimate all the genomic features possible using unified methodology. ### Regenerating figures @@ -11,20 +11,39 @@ The main figures are be plotted by following R scripts ``` Rscript scripts/plot_figure_1_questions.R --tricolor +# generates figures/fig1_genomic_studies.pdf Rscript scripts/plot_figure_2_heterozygosity.R --split_axis --homoeolog --rm_boxes +# generates figures/fig2_heterozygosity_split_axis_homoeolog.pdf Rscript scripts/plot_figure_3_heterozygosity_structure.R +# generates figures/fig3_heterozygosity_of_tetraploids.pdf Rscript scripts/plot_figure_4_TEs.R +# generates figures/fig4_TEs.pdf ``` The supplementary figures ``` -# supp figure 1 is a derivative of figure 1 Rscript scripts/plot_figure_1_questions.R --refs --both --tricolor -Rscript scripts/plot_figure_S4_expected_heterozygosity_structure.R -Rscript scripts/plot_figure_S7_TEs_vs_mode_and_origin.R +# supp figure 3 is a derivative of figure 1 +# generates figures/SM_Figure_3_genomic_studies.pdf +Rscript scripts/plot_figure_S6_expected_heterozygosity_structure.R +Rscript scripts/plot_figure_S8_TEs_vs_mode_and_origin.R +# figures/SM_Figure_8_TEs ``` +### Supplementary tables + +[Supplementary Table 1](LaTeX/SM_table_1_reproduction_modes.pdf): Overview of analysed species. This information was collected directly from the cited literature. References include information regarding cellular mode of reproduction, origin and/or the age of parthenogenesis. + + +[Supplementary Table 2](tables/genome_table_infered_from_reads.tsv): Genomic features calculated from raw data. We used unified methods to estimate basic genomic properties directly from sequencing reads. Ploidy was estimated using smudgeplot for all species but A. vaga (see section Heterozygosity structure in polyploids for details). Genome size, heterozygosity and repeats were estimated using GenomeScope. Repeats denote the fraction of the genome occurring in more than one copy. The classified repeats, TEs and other types of classified repeats, were estimated using DnaPipeTE. + + +[Supplementary Table 3](tables/assembly_table.tsv): genome assemblies: size, number of scaffolds, N50, BUSCO, number of annotated genes. Statistics were calculated from the published genome assemblies and genome annotations shared by authors. BUSCO genes were searched using the metazoan database for all the non-nematode species. Nematodes are notoriously known for the high turnover of genes and we therefore used nematode specific BUSCO genes. The number of annotated genes were calculated as the number of lines in the annotation with the tag “gene”. The number of genes was extracted using the tag “mRNA” since the keyword “gene” was not in the annotation file of Diploscapter coronatus. + +Supplementary Table 4: Horizontal gene transfer analysis. +HGT candidate genes identified from comparisons to [UniRef90](tables/JOH-2020-024.S4Table.HGT_sheet1_uniref.tsv) and [UniProtKB/Swissprot](tables/JOH-2020-024.S4Table.HGT_sheet2_uniprot.tsv) databases. Column TaxID is NCBI TaxID for focal species; Num.genes is the Number of protein-coding genes in annotation; HGTc is Horizontal gene transfer candidates (i.e. putative foreign gene); Columns E-H: Phylum/Class/Order/Family indicates the taxonomic level at which hits to the focal animal's lineage were discounted; Columns I-L: HGTc expressed as a percentage of total CDS encoded in focal genome (Column D). The supplementary table in the paper is an excel sheet. + ### List of performed analysis - [GenomeScope](https://github.com/tbenavi1/genomescope) `v2 dev` - genome profiling from kmer spectra of sequencing reads. Estimate of genome size, heterozygosity and repetitive content. @@ -40,15 +59,11 @@ Rscript scripts/plot_figure_S7_TEs_vs_mode_and_origin.R The labels of genomes are composed of **G**enus and **spe**cies name `Gspe` and an index, which serves only as a distinction of the different sequencing projects. The full list of genomes considered in this study is in the table [tables/download_table.tsv](tables/download_table.tsv). -## Development - -**What should be in this repository:** +**What is in this repository:** - scripts for downloading, processing and analyzing asexual genomes -- a small table of analyzed asexual genomes, their code names and urls for downloading -- one big table -> an overview of all the asexual genomes -- other small summary tables of computationally intensive tasks -- the paper +- a table of analyzed asexual genomes, their code names and urls for downloading +- summary tables of the results The analysis is automated using [snakemake](https://snakemake.readthedocs.io/en/stable/), tested with version `4.8.0`. The scripts for analysis are combination of `bash`, `R` anf `python`. @@ -67,7 +82,7 @@ to run default with other flags you can run ./snakemake_clust.sh " " {other_flags} {other_flags} ... ``` -### Execution of different cluster +### Execution on a different cluster `Snakefile` has no hardcoded any cluster-specific parameters. The resources should be accessed as `{resources.mem}` for memory in kilobytes, `{resources.tmp}` for needed local storage in megabytes and `{threads}` for number of used cores. The command used for cluster execution is stored in a bash wrapper `snakemake_clust.sh`. Modify this script as needed to work with syntax of your cluster. It uses environmental variable `USE_LOCAL` to access if computations should be performed on local disks of computational nodes or not (the job wrapper is `scripts/use_local.sh` and it might need to be adjusted to different cluster settings, now it's set for lsf). diff --git a/scripts/make_hgt_table.R b/scripts/make_hgt_table.R deleted file mode 100644 index f2d222b..0000000 --- a/scripts/make_hgt_table.R +++ /dev/null @@ -1,25 +0,0 @@ -# pull data from Reuben's google doc and turn it into supplementary a table - -library('gsheet') - -hgt_data <- read.csv(text = gsheet2text("https://docs.google.com/spreadsheets/d/1vxEQ51UdlunRDR9Nr9eeDGI8xSHGd-Hqa9fq8C3fzv0/edit?usp=sharing", format='csv'), - stringsAsFactors = F, header = T, check.names = F) -output_table <- 'tables/HGT_table.tsv' - -columns_to_keep <- c('id', 'num_proteins', 'uniref_hits', 'uniref_hits_perc', - 'kphylum', 'kphylum%', 'kphylum+linked', 'kphylum+linked%') - -# keep only the complete data -hgt_data <- hgt_data[hgt_data$status == 'complete',] - -sexuals <- grepl("^s", hgt_data$id) -hgt_data_sexuals <- hgt_data[sexuals, ] -hgt_data_asexuals <- hgt_data[!sexuals, ] - -length(unique(hgt_data_asexuals$species)) - 2 -# - 2 is for three Panagrolaimus strains that are considered as one species in this study -# 23 species - -tab_to_save <- hgt_data[,colnames(hgt_data) %in% columns_to_keep] - -write.table(tab_to_save, output_table, quote = F, sep = '\t', row.names = F) diff --git a/tables/HGT_table.tsv b/tables/HGT_table.tsv deleted file mode 100644 index b3c79ec..0000000 --- a/tables/HGT_table.tsv +++ /dev/null @@ -1,34 +0,0 @@ -id num_proteins uniref_hits kphylum kphylum% kphylum+linked kphylum+linked% -Anan1 35,495 22,568 894 2.52% 255 0.72% -Aric1 58,449 39,080 6,420 10.98% 5,864 10.03% -Avag1 67,256 42,130 8,142 12.11% 7,174 10.67% -Dcor1 34,063 29,310 217 0.64% 215 0.63% -Dpac1 38,423 38,276 3,199 8.33% 208 0.54% -Fcan1 28,734 28,634 1,828 6.36% 1,822 6.34% -sOcin1 20,247 20,111 1,290 6.37% 972 4.80% -Hduj1 20,252 18,582 371 1.83% 366 1.81% -Lcla1 50,004 16,773 2,332 4.66% 324 0.65% -Mare1 101,269 67,158 1,106 1.09% 476 0.47% -Mare2 27,633 23,504 340 1.23% 64 0.23% -Mbel1 29,883 16,696 214 0.72% 176 0.59% -Ment1 29,578 23,698 292 0.99% 65 0.22% -Mflo1 12,763 10,712 153 1.20% 41 0.32% -Minc1 43,718 36,581 651 1.49% 352 0.81% -Minc2 23,680 19,070 284 1.20% 86 0.36% -Mjav1 97,208 65,270 981 1.01% 317 0.33% -Mjav2 24,753 21,611 353 1.43% 88 0.36% -Obir1 24,105 24,051 3,550 14.73% 3,544 14.70% -sHsal1 26,735 26,660 3,929 14.70% 2,486 9.30% -sCfls1 23,922 23,882 3,554 14.86% 2,813 11.76% -Pdav1 31,630 15,498 2,481 7.84% 54 0.17% -Pfor1 49,526 49,345 212 0.43% 181 0.37% -Ps591 26,760 16,093 347 1.30% 116 0.43% -Ps791 18,060 11,594 272 1.51% 34 0.19% -Psam1 40,530 25,760 404 1.00% NA NA -Pvir1 22,206 10,728 823 3.71% 211 0.95% -Rmac1 25,818 18,066 3,232 12.52% 2,167 8.39% -Rmag1 37,284 22,524 3,754 10.07% 2,674 7.17% -Rvar1 14,248 13,981 198 1.39% 186 1.31% -Tpre1 13,200 12,228 1,618 12.26% 1,336 10.12% -sNvit1 24,529 24,448 3,141 12.81% 2,812 11.46% -sCflm1 17,189 17,125 2,228 12.96% 1,806 10.51% diff --git a/tables/JOH-2020-024.S4Table.HGT_sheet1_uniref.tsv b/tables/JOH-2020-024.S4Table.HGT_sheet1_uniref.tsv new file mode 100644 index 0000000..51b1c91 --- /dev/null +++ b/tables/JOH-2020-024.S4Table.HGT_sheet1_uniref.tsv @@ -0,0 +1,33 @@ +ID Species Type TaxID Num.genes HGTc.phylum HGTc.class HGTc.order HGTc.family HGTc.phylum.% HGTc.class.% HGTc.order.% HGTc.family.% +Anan1 Acrobeloides_nanus Asexual 290746 35495 255 255 327 235 0.72% 0.72% 0.92% 0.66% +Aric1 Adineta_ricciae Asexual 249248 58449 5864 5780 5567 5567 10.03% 9.89% 9.52% 9.52% +Avag1 Adineta_vaga Asexual 104782 67256 7174 7071 6647 6647 10.67% 10.51% 9.88% 9.88% +Dcor1 Diploscapter_coronatus Asexual 288516 34063 215 186 86 61 0.63% 0.55% 0.25% 0.18% +Dpac1 Diploscapter_pachys Asexual 2018661 38423 208 199 120 145 0.54% 0.52% 0.31% 0.38% +Fcan1 Folsomia_candida Asexual 158441 28734 1822 775 775 491 6.34% 2.70% 2.70% 1.71% +sOcin1 Orchesella_cincta Sexual 48709 20247 972 211 210 38 4.80% 1.04% 1.04% 0.19% +Hduj1 Hypsibius_dujardini Asexual 232323 20252 366 366 366 225 1.81% 1.81% 1.81% 1.11% +Lcla1 Leptopilina_clavipes Asexual 63434 50004 324 413 187 62 0.65% 0.83% 0.37% 0.12% +Mare1 Meloidogyne_arenaria Asexual 6304 101269 476 417 416 210 0.47% 0.41% 0.41% 0.21% +Mare2 Meloidogyne_arenaria Asexual 6304 27633 64 65 69 26 0.23% 0.24% 0.25% 0.09% +Mbel1 Mesorhabditis_belari Asexual 2138241 29883 176 165 85 61 0.59% 0.55% 0.28% 0.20% +Ment1 Meloidogyne_enterica Asexual 390850 29578 65 73 85 44 0.22% 0.25% 0.29% 0.15% +Mflo1 Meloidogyne_floridensis Asexual 298350 12763 41 48 52 27 0.32% 0.38% 0.41% 0.21% +Minc1 Meloidogyne_incognita Asexual 6306 43718 352 343 255 122 0.81% 0.78% 0.58% 0.28% +Minc2 Meloidogyne_incognita Asexual 6306 23680 86 83 93 47 0.36% 0.35% 0.39% 0.20% +Mjav1 Meloidogyne_javanica Asexual 6303 97208 317 324 314 165 0.33% 0.33% 0.32% 0.17% +Mjav2 Meloidogyne_javanica Asexual 6303 24753 88 97 105 59 0.36% 0.39% 0.42% 0.24% +Obir1 Ooceraea_biroi Asexual 2015173 24105 3544 3222 630 7 14.70% 13.37% 2.61% 0.03% +sHsal1 Harpagoxenus_saltator Sexual 610380 26735 2486 2843 633 11 9.30% 10.63% 2.37% 0.04% +sCfls1 Camponotus_floridanus Sexual 104421 23922 2813 2899 638 6 11.76% 12.12% 2.67% 0.03% +Pdav1 Panagrolaimus_davidi Asexual 227884 31630 54 54 78 96 0.17% 0.17% 0.25% 0.30% +Pfor1 Poecilia_formosa Asexual 48698 49526 181 51 0 0 0.37% 0.10% 0.00% 0.00% +Ps591 Panagrolaimus_sp.PS1159 Asexual 55785 26760 116 119 118 81 0.43% 0.44% 0.44% 0.30% +Ps791 Panagrolaimus_sp.PS1579 Asexual 310962 18060 34 36 33 25 0.19% 0.20% 0.18% 0.14% +Pvir1 Procambarus_virginalis Asexual 2065263 22206 211 49 45 41 0.95% 0.22% 0.20% 0.18% +Rmac1 Rotaria_macrura Asexual 392029 25818 2167 2122 1929 1929 8.39% 8.22% 7.47% 7.47% +Rmag1 Rotaria_magnacalcarata Asexual 392030 37284 2674 2626 2415 2415 7.17% 7.04% 6.48% 6.48% +Rvar1 Ramazzottius_varieornatus Asexual 947166 14248 186 186 186 69 1.31% 1.31% 1.31% 0.48% +Tpre1 Trichogramma_pretiosum Asexual 7493 13200 1336 1148 181 8 10.12% 8.70% 1.37% 0.06% +sNvit1 Nasonia_vitripennis Sexual 7425 24529 2812 2564 503 2 11.46% 10.45% 2.05% 0.01% +sCflm1 Copidosoma_floridanum Sexual 29053 17189 1806 1634 229 3 10.51% 9.51% 1.33% 0.02% diff --git a/tables/JOH-2020-024.S4Table.HGT_sheet2_uniprot.tsv b/tables/JOH-2020-024.S4Table.HGT_sheet2_uniprot.tsv new file mode 100644 index 0000000..ef163d8 --- /dev/null +++ b/tables/JOH-2020-024.S4Table.HGT_sheet2_uniprot.tsv @@ -0,0 +1,33 @@ +ID Species Type TaxID Num.genes HGTc.phylum HGTc.class HGTc.order HGTc.family HGTc.phylum.% HGTc.class.% HGTc.order.% HGTc.family.% +Anan1 Acrobeloides_nanus Asexual 290746 35495 235 235 235 205 0.66% 0.66% 0.66% 0.58% +Aric1 Adineta_ricciae Asexual 249248 58449 2236 2532 2532 2532 3.83% 4.33% 4.33% 4.33% +Avag1 Adineta_vaga Asexual 104782 67256 2597 2948 2948 2948 3.86% 4.38% 4.38% 4.38% +Dcor1 Diploscapter_coronatus Asexual 288516 34063 387 475 475 466 1.14% 1.39% 1.39% 1.37% +Dpac1 Diploscapter_pachys Asexual 2018661 38423 386 463 463 456 1.00% 1.21% 1.21% 1.19% +Fcan1 Folsomia_candida Asexual 158441 28734 898 776 776 776 3.13% 2.70% 2.70% 2.70% +sOcin1 Orchesella_cincta Sexual 48709 20247 330 279 279 279 1.63% 1.38% 1.38% 1.38% +Hduj1 Hypsibius_dujardini Asexual 232323 20252 260 260 260 260 1.28% 1.28% 1.28% 1.28% +Lcla1 Leptopilina_clavipes Asexual 63434 50004 56 86 71 71 0.11% 0.17% 0.14% 0.14% +Mare1 Meloidogyne_arenaria Asexual 6304 101269 437 437 439 357 0.43% 0.43% 0.43% 0.35% +Mare2 Meloidogyne_arenaria Asexual 6304 27633 77 77 79 60 0.28% 0.28% 0.29% 0.22% +Mbel1 Mesorhabditis_belari Asexual 2138241 29883 258 258 258 252 0.86% 0.86% 0.86% 0.84% +Ment1 Meloidogyne_enterica Asexual 390850 29578 74 74 74 57 0.25% 0.25% 0.25% 0.19% +Mflo1 Meloidogyne_floridensis Asexual 298350 12763 39 39 39 29 0.31% 0.31% 0.31% 0.23% +Minc1 Meloidogyne_incognita Asexual 6306 43718 378 499 500 376 0.86% 1.14% 1.14% 0.86% +Minc2 Meloidogyne_incognita Asexual 6306 23680 99 99 100 81 0.42% 0.42% 0.42% 0.34% +Mjav1 Meloidogyne_javanica Asexual 6303 97208 299 299 299 243 0.31% 0.31% 0.31% 0.25% +Mjav2 Meloidogyne_javanica Asexual 6303 24753 91 91 93 80 0.37% 0.37% 0.38% 0.32% +Obir1 Ooceraea_biroi Asexual 2015173 24105 173 171 114 114 0.72% 0.71% 0.47% 0.47% +sHsal1 Harpagoxenus_saltator Sexual 610380 26735 193 192 139 139 0.72% 0.72% 0.52% 0.52% +sCfls1 Camponotus_floridanus Sexual 104421 23922 183 181 126 126 0.76% 0.76% 0.53% 0.53% +Pdav1 Panagrolaimus_davidi Asexual 227884 31630 203 203 205 193 0.64% 0.64% 0.65% 0.61% +Pfor1 Poecilia_formosa Asexual 48698 49526 5890 103 100 100 11.89% 0.21% 0.20% 0.20% +Ps591 Panagrolaimus_sp.PS1159 Asexual 55785 26760 148 148 148 110 0.55% 0.55% 0.55% 0.41% +Ps791 Panagrolaimus_sp.PS1579 Asexual 310962 18060 25 25 26 22 0.14% 0.14% 0.14% 0.12% +Pvir1 Procambarus_virginalis Asexual 2065263 22206 18 13 13 13 0.08% 0.06% 0.06% 0.06% +Rmac1 Rotaria_macrura Asexual 392029 25818 772 772 772 772 2.99% 2.99% 2.99% 2.99% +Rmag1 Rotaria_magnacalcarata Asexual 392030 37284 936 936 936 936 2.51% 2.51% 2.51% 2.51% +Rvar1 Ramazzottius_varieornatus Asexual 947166 14248 174 174 174 174 1.22% 1.22% 1.22% 1.22% +Tpre1 Trichogramma_pretiosum Asexual 7493 13200 305 304 273 273 2.31% 2.30% 2.07% 2.07% +sNvit1 Nasonia_vitripennis Sexual 7425 24529 179 177 118 118 0.73% 0.72% 0.48% 0.48% +sCflm1 Copidosoma_floridanum Sexual 29053 17189 121 118 72 72 0.70% 0.69% 0.42% 0.42% \ No newline at end of file diff --git a/tables/genome_table.tsv b/tables/genome_table.tsv index 68a2f0e..c9daa05 100644 --- a/tables/genome_table.tsv +++ b/tables/genome_table.tsv @@ -9,7 +9,7 @@ Lcla1 Leptopilina_clavipes gamete_duplication no 2 255.4 36.6 13.8 93.87 4.09 0. Tpre1 Trichogramma_pretiosum gamete_duplication no 2 195.1 0.4 3706.2 98.36 0.31 4.4 1.33 192.7 0.03 15.06 6.2 2.3 21.5 Obir1 Ooceraea_biroi central_fusion no 2 212.8 4.6 1350.7 98.16 1.02 1.02 0.82 184.8 0.53 9.91 12.1 0.6 19.7 Amel1 Apis_mellifera_capensis central_fusion no 2 NA NA NA NA NA NA NA 238.8 0.82 16.13 5.5 0.4 26.1 -Amel2 Apis_mellifera_capensis central_fusion 2 NA NA NA NA NA NA NA 236.9 0.86 15.39 5.3 0.3 25.8 +Amel2 Apis_mellifera_capensis central_fusion no 2 NA NA NA NA NA NA NA 236.9 0.86 15.39 5.3 0.3 25.8 Aruf1 Aptinothrips_rufus gamete_duplication no 2 339.9 301.7 4.1 97.55 1.33 0.82 1.12 362.7 0.25 31.27 8.4 2.9 38.2 Fcan1 Folsomia_candida terminal_fusion no 2 221.7 0.2 6519.4 93.15 1.64 1.94 5.21 295.7 0.125 45.81 8.9 0.5 28.1 Dpul2 Daphnia_pulex central_fusion yes 2 NA NA NA NA NA NA NA NA NA NA 6.1 4.5 22.8