Skip to content

Codes I wrote for the paper : "Global determinants of freshwater and marine fish genetic diversity" Nature Communications, 2020

Notifications You must be signed in to change notification settings

Grelot/global_fish_genetic_diversity

Repository files navigation

Codes for the paper : "Global determinants of freshwater and marine fish genetic diversity"

https://www.singularity-hub.org/static/img/hosted-singularity--hub-%23e32929.svg

Stephanie Manel, Pierre-Edouard Guerin, David Mouillot, Simon Blanchet, Laure Velez, Camille Albouy, Loic Pellissier

Montpellier, 2017-2019

Published in Nature Communications, 2020
full-text acces: https://rdcu.be/b1sXy


A web application is available to display Figure 1 with more details: https://shiny.cefe.cnrs.fr/wfgd/

Codes also availables on gitlab: https://gitlab.mbb.univ-montp2.fr/reservebenefit/worldmap_fish_genetic_diversity


Table of contents

  1. Introduction
  2. Installation
    1. Prerequisites
    2. Singularity container
    3. Data Files
    4. Set up
  3. Scripts Code Source
  4. Running the pipeline
    1. Filter raw data
    2. Georeferenced sequences alignments by species
    3. Species sequence pairwise comparison
    4. Genetic diversity calculation
    5. Statistical analysis
      1. Merge genetic data with environmental data by cell
      2. Figures
      3. Analysis
      4. Supplementary figures
    6. Taxonomy and habitat attributed to each individual sequences

1. Introduction

This repository contains all the scripts to reproduce the results of the paper Manel et al. (2019) from the georeferenced barcode sequences of the supergroup "actinopterygii" downloaded from BOLD on 17th september 2018.

The pipeline is composed of 6 steps :

  1. Filter raw data
  2. Georeferenced sequences alignments by species
  3. Species sequence pairwise comparison
  4. Genetic Diversity calculation
  5. Statistical analysis
  6. Taxonomy and habitat attributed to each individual sequences

Figures and statistical analysis can be reproduced directly (see Figures section) without running the whole pipeline.

Only datafiles necessary to initiate the whole pipeline as well as to produce figures and statisticial analysis are provided.

2. Installation

2.1 Prerequisites

You must install the following softwares and packages to run all steps: For Figures and statiscal analysis, only R packages are needed.

  • JULIA Version 1.1.0
    • julia-module DataFrames
    • julia-module DelimitedFiles
    • julia-module DataFramesMeta
    • julia-module StatsBase
    • julia-module Statistics
    • julia-module CSV
  • R Version 3.2.3
    • R-package raster
    • R-package plotrix
    • R-package sp
    • R-package maptools
    • R-package parallel
    • R-package png
    • R-package plyr
    • R-package shape
    • R-package MASS
    • R-package hier.part
    • R-package countrycode
    • R-package sjPlot
    • R-package gridExtra
    • R-package ggplot2
    • R-package lme4
    • R-package SpatialPack
    • R-package rgeos | if install.packages("rgeos") failed, then try: install.packages("https://cran.r-project.org/src/contrib/Archive/rgeos/rgeos_0.3-26.tar.gz", type="source")
    • R-package rgdal | it may require to install "libgdal-dev"
    • R-package rfishbase
    • R-package pgirmess
    • R-package car
  • Python Version 3.6.8
    • python3-module argparse
    • python3-module re
    • python3-module ete3
    • python3-module numpy
    • python3-module csv
    • python3-module re
    • python3-module csv
    • python3-module difflib
  • MUSCLE Version 3.8.31

2.2 Singularity container

Alternatively, you can download and use a singularity container with all prerequisites (R, Julia, Python, Muscle).

Install Singularity

See https://www.sylabs.io/docs/ for instructions to install Singularity.

Download the container

singularity pull --name global_fish_genetic_diversity.simg shub://Grelot/global_fish_genetic_diversity:global_fish_genetic_diversity

Use the container

This command will spawn a shell environment with all prerequisites.

singularity shell global_fish_genetic_diversity.simg

2.3 Data files

The included data files are :

2.4 Set Up

Clone the project and switch to the main folder, it's your working directory

git clone http://gitlab.mbb.univ-montp2.fr/reservebenefit/worldmap_fish_genetic_diversity.git
cd worldmap_fish_genetic_diversity

You're ready to run the analysis. Now follow the instructions at Running the pipeline

3. Scripts code source

For more information, we provide a detailed description of each scripts in SCRIPTS.md

4. Running the pipeline

4.1 Filter raw data

  1. Keep only the CO1 sequences with lat/lon information
bash 00-scripts/step1/filter_raw_data.sh

4.2 Georeferenced sequences alignments by species

  1. Align sequences from the same species with MUSCLE and create coordinates .coord file for each sequence
bash 00-scripts/step2/seq_alnt_filtered_data.sh
  1. According to a list of marine species, move fasta and coords files into marine or freshwater folder
mkdir 06-species_alnt_cluster/total
mkdir 06-species_alnt_cluster/freshwater
mkdir 06-species_alnt_cluster/marine
bash 00-scripts/step2/cluster_freshwater_vs_marine.sh
  1. Attribute at each individual sequences an ID of cell of the shapefile of worldmap equal area projection from its coordinates
Rscript 00-scripts/step2/equalareacoords.R

4.3 Species sequence pairwise comparison

  1. Generate individual sequences pairwise comparison data matrices for each species for both each cell and each latitudinal band from species sequences alignments and cell locations.
julia 00-scripts/step3/master_matrices.jl

4.4 Genetic diversity calculation

  1. Attribute mean genetic diversity value at both each cell and each latitudinal band. In latitudinal band case, we filter out species with no genetic diversity or/and less than 3 individuals.Standard deviation are estimated from 1000 bootstrapped replications.
julia 00-scripts/step4/equalarea_numbers.jl
julia 00-scripts/step4/latband_numbers.jl
  1. Generate a table of cell_coordinates, mean of Genetic diversity, number of species, mean/sd number of individuals by species into each cell.
bash 00-scripts/step4/gdval_by_cell.sh
julia 00-scripts/step4/metrics_by_area_and_species.jl

4.5 Statistical analysis

To generate figures (or analysis), you can simply type the proposed commands on your terminal or alternatively open the R script and run the commands in R

4.5.1 Merge genetic data with environmental data by cell

  1. Generate a table of center of cell (xy) coordinates, ID of cell, mean of Genetic diversity, number of species, mean/sd number of individuals by species into each cell, bathymetry, chlorophyll concentration, oxygen concentration, temperature, drainage basin surface area into each cell
Rscript 00-scripts/step5/descripteurs.R

4.5.2 Figures

  1. Map of the global distribution of genetic diversity for marine species
Rscript 00-scripts/step5/figures/figure1.R
  1. Congruence between fish genetic and species diversity
Rscript 00-scripts/step5/figures/figure2.R
  1. Determinant of the patterns of fish genetic diversity
Rscript 00-scripts/step5/figures/figure3.R

4.5.3 Analysis

  1. Wilcoxon test to assess whether genetic diversity means differ between marine and freshwater species.
Rscript 00-scripts/step5/analysis/wilcoxon_tests.R
  1. Sensitivity analysis
Rscript 00-scripts/step5/analysis/sensitive_analysis_model.R
  1. Sensitivity analysis based on taxonomic coverage

Rscript 00-scripts/step5/analysis/sensitive_analysis_covtax.R

4.5.3 Supplementary figures

  1. Spatial autocorrelogramme based on the I-Moran coefficient
Rscript 00-scripts/step5/supplementary_figures/figureS1.R
  1. Global distribution of higher and lower percentiles of genetic diversity
Rscript 00-scripts/step5/supplementary_figures/figureS2.R
  1. Latitudinal distribution of species diversity
Rscript 00-scripts/step5/supplementary_figures/figureS3.R
  1. Regional effect on the global genetic diversity pattern
Rscript 00-scripts/step5/supplementary_figures/figureS4.R
  1. Sampling effect
Rscript 00-scripts/step5/supplementary_figures/figureS5.R
  1. Taxonomic coverage of the sequences used by the model
Rscript 00-scripts/step5/supplementary_figures/figureS6.R
  1. Spatial distribution of taxonomic coverage
Rscript 00-scripts/step5/supplementary_figures/figureS7.R
  1. Intraspecific genetic diversity mean in each 10° latitudinal bands (not in the paper)
Rscript 00-scripts/step5/supplementary_figures/figureS8.R

4.6 Taxonomy and habitat attributed to each individual sequences

  1. Write a table of individual sequences with geographical cell localisation
python3 00-scripts/step6/sequences_table.py
  1. Check if the watertype marine|freshwater for each species by cell is correct according to the model marine|freshwater
Rscript 00-scripts/step6/check_freshwater_assignation.R
  1. Assign habitat (demersal, pelagic...) information to each individual sequences according to their attributed species name
Rscript 00-scripts/step6/sequences_demerpelag.R
  1. Cure habitat assignation and species name for each individual sequences
Rscript 00-scripts/step6/sequences_cure_species_name.R
  1. cure family column by renaming BOLD family by its equivalent into NCBI taxonomy
bash 00-scripts/step6/rename_family_bold_to_ncbi.sh
  1. Write a table of number of species/number of sequences by taxonomic order/family used for each model
python3 00-scripts/step6/sequences_taxonomy.py