Stephanie Manel, Pierre-Edouard Guerin, David Mouillot, Simon Blanchet, Laure Velez, Camille Albouy, Loic Pellissier
Montpellier, 2017-2019
Published in Nature Communications, 2020
full-text acces: https://rdcu.be/b1sXy
A web application is available to display Figure 1 with more details: https://shiny.cefe.cnrs.fr/wfgd/
Codes also availables on gitlab: https://gitlab.mbb.univ-montp2.fr/reservebenefit/worldmap_fish_genetic_diversity
- Introduction
- Installation
- Scripts Code Source
- Running the pipeline
This repository contains all the scripts to reproduce the results of the paper Manel et al. (2019) from the georeferenced barcode sequences of the supergroup "actinopterygii" downloaded from BOLD on 17th september 2018.
The pipeline is composed of 6 steps :
- Filter raw data
- Georeferenced sequences alignments by species
- Species sequence pairwise comparison
- Genetic Diversity calculation
- Statistical analysis
- Taxonomy and habitat attributed to each individual sequences
Figures and statistical analysis can be reproduced directly (see Figures section) without running the whole pipeline.
Only datafiles necessary to initiate the whole pipeline as well as to produce figures and statisticial analysis are provided.
You must install the following softwares and packages to run all steps: For Figures and statiscal analysis, only R packages are needed.
- JULIA Version 1.1.0
julia-module
DataFramesjulia-module
DelimitedFilesjulia-module
DataFramesMetajulia-module
StatsBasejulia-module
Statisticsjulia-module
CSV
- R Version 3.2.3
R-package
rasterR-package
plotrixR-package
spR-package
maptoolsR-package
parallelR-package
pngR-package
plyrR-package
shapeR-package
MASSR-package
hier.partR-package
countrycodeR-package
sjPlotR-package
gridExtraR-package
ggplot2R-package
lme4R-package
SpatialPackR-package
rgeos | if install.packages("rgeos") failed, then try: install.packages("https://cran.r-project.org/src/contrib/Archive/rgeos/rgeos_0.3-26.tar.gz", type="source")R-package
rgdal | it may require to install "libgdal-dev"R-package
rfishbaseR-package
pgirmessR-package
car
- Python Version 3.6.8
python3-module
argparsepython3-module
repython3-module
ete3python3-module
numpypython3-module
csvpython3-module
repython3-module
csvpython3-module
difflib
- MUSCLE Version 3.8.31
Alternatively, you can download and use a singularity container with all prerequisites (R, Julia, Python, Muscle).
See https://www.sylabs.io/docs/ for instructions to install Singularity.
singularity pull --name global_fish_genetic_diversity.simg shub://Grelot/global_fish_genetic_diversity:global_fish_genetic_diversity
This command will spawn a shell environment with all prerequisites.
singularity shell global_fish_genetic_diversity.simg
The included data files are :
- seqbold_data.tsv : georeferenced barcode sequences of the supergroup "actinopterygii" downloaded from BOLD on 17th september 2018
- grid_equalarea200km : shapefile of worldmap equal area projection epsg:4326 with nested equal area grids (cell sizes of 200km)
- equalarea_id_coords.tsv : ID and left/right/top/bottom coordinates of each equal area into the shapefile grid_equalarea200km.
- marine_actinopterygii_species.txt : list of "actinopterygii" saltwater species according to fishbase
- marine_bo_o2dis.asc : spatial layer of marine oxygen concentration [mol/l] from gmed
- marine_bo_sst_mean.asc : spatial layer of sea surface temperature from gmed
- marine_velocity_mean.asc : spatial layer of velocity of velocity (marine) from gmed
- freshwater_wc2.0_bio_10m_01.tif : spatial layer of global mean temperature from worldclim
- freshwater_velocity_mean.tif : spatial layer of velocity (freshwater) from gmed
- datatoFigshare : shapefile of drainage basins from a global database on freshwater fish species occurrence in drainage basins
- datacell_grid_descriteurs.csv : table of bathymetry, chlorophyll and other information attributed to each cell extracted from gebco and bio-oracle
- equalarea_id_coordsCA_FWRS_MR_RS.csv : Table of species diversity by 200km square gridcell from OBIS
- ne_50m_land : shapefile of worldcoast from naturalearthdata
- ne_50m_rivers_lake_centerlines_scale_rank : shapefile of riverlines from naturalearthdata
- GSHHS_h_L2.shp : shape polygon file of big lakes from naturalearthdata
- EnvFreshwater.csv : slope and flow information for each geographical cell with a river from gebco
- distanceCote : distance from shore for each cell calculated from gmed
- ornament fishes images :free silhouette images of fishes from phylopic
Clone the project and switch to the main folder, it's your working directory
git clone http://gitlab.mbb.univ-montp2.fr/reservebenefit/worldmap_fish_genetic_diversity.git
cd worldmap_fish_genetic_diversity
You're ready to run the analysis. Now follow the instructions at Running the pipeline
For more information, we provide a detailed description of each scripts in SCRIPTS.md
- Keep only the CO1 sequences with lat/lon information
- input : seqbold_data.tsv
- output : co1_ssll_seqbold_data.tsv
bash 00-scripts/step1/filter_raw_data.sh
- Align sequences from the same species with MUSCLE and create coordinates .coord file for each sequence
- input : co1_ssll_seqbold_data.tsv
- output : 05-species_alnt .fasta &.coords files
bash 00-scripts/step2/seq_alnt_filtered_data.sh
- According to a list of marine species, move fasta and coords files into marine or freshwater folder
mkdir 06-species_alnt_cluster/total
mkdir 06-species_alnt_cluster/freshwater
mkdir 06-species_alnt_cluster/marine
bash 00-scripts/step2/cluster_freshwater_vs_marine.sh
- Attribute at each individual sequences an ID of cell of the shapefile of worldmap equal area projection from its coordinates
- input : grid_equalarea200km, /06-species_alnt_cluster/marine, /06-species_alnt_cluster/freshwater .coords files
- output : /06-species_alnt_cluster/marine, /06-species_alnt_cluster/freshwater .equalareacoords files
Rscript 00-scripts/step2/equalareacoords.R
- Generate individual sequences pairwise comparison data matrices for each species for both each cell and each latitudinal band from species sequences alignments and cell locations.
- input : 06-species_alnt_cluster
- output : 07-master_matrices
julia 00-scripts/step3/master_matrices.jl
- Attribute mean genetic diversity value at both each cell and each latitudinal band. In latitudinal band case, we filter out species with no genetic diversity or/and less than 3 individuals.Standard deviation are estimated from 1000 bootstrapped replications.
- input : 07-master_matrices
- output : equalarea_numbers.csv
julia 00-scripts/step4/equalarea_numbers.jl
julia 00-scripts/step4/latband_numbers.jl
- Generate a table of cell_coordinates, mean of Genetic diversity, number of species, mean/sd number of individuals by species into each cell.
- inputs : equalarea_numbers.csv, gdval_by_area.csv
- outputs : metrics_by_area_freshwater.csv, metrics_by_area_marine.csv
bash 00-scripts/step4/gdval_by_cell.sh
julia 00-scripts/step4/metrics_by_area_and_species.jl
To generate figures (or analysis), you can simply type the proposed commands on your terminal or alternatively open the R script and run the commands in R
- Generate a table of center of cell (xy) coordinates, ID of cell, mean of Genetic diversity, number of species, mean/sd number of individuals by species into each cell, bathymetry, chlorophyll concentration, oxygen concentration, temperature, drainage basin surface area into each cell
Rscript 00-scripts/step5/descripteurs.R
- Map of the global distribution of genetic diversity for marine species
Rscript 00-scripts/step5/figures/figure1.R
- Congruence between fish genetic and species diversity
Rscript 00-scripts/step5/figures/figure2.R
- Determinant of the patterns of fish genetic diversity
Rscript 00-scripts/step5/figures/figure3.R
- Wilcoxon test to assess whether genetic diversity means differ between marine and freshwater species.
Rscript 00-scripts/step5/analysis/wilcoxon_tests.R
- Sensitivity analysis
Rscript 00-scripts/step5/analysis/sensitive_analysis_model.R
-
Sensitivity analysis based on taxonomic coverage
Rscript 00-scripts/step5/analysis/sensitive_analysis_covtax.R
- Spatial autocorrelogramme based on the I-Moran coefficient
Rscript 00-scripts/step5/supplementary_figures/figureS1.R
- Global distribution of higher and lower percentiles of genetic diversity
Rscript 00-scripts/step5/supplementary_figures/figureS2.R
- Latitudinal distribution of species diversity
Rscript 00-scripts/step5/supplementary_figures/figureS3.R
- Regional effect on the global genetic diversity pattern
- input : total_data_genetic_diversity_with_all_descripteurs.tsv
- output : figureS4.pdf
Rscript 00-scripts/step5/supplementary_figures/figureS4.R
- Sampling effect
Rscript 00-scripts/step5/supplementary_figures/figureS5.R
- Taxonomic coverage of the sequences used by the model
- input : watertype_all_modeles_effectives_family.csv
- output : figureS6.pdf
Rscript 00-scripts/step5/supplementary_figures/figureS6.R
- Spatial distribution of taxonomic coverage
Rscript 00-scripts/step5/supplementary_figures/figureS7.R
- Intraspecific genetic diversity mean in each 10° latitudinal bands (not in the paper)
- inputs : freshwater_latbands_bootstraps.csv, marine_latbands_bootstraps.csv
- output : figureS8.pdf
Rscript 00-scripts/step5/supplementary_figures/figureS8.R
- Write a table of individual sequences with geographical cell localisation
python3 00-scripts/step6/sequences_table.py
- Check if the watertype marine|freshwater for each species by cell is correct according to the model marine|freshwater
- inputs : map_marine_sequences.csv, metrics_by_area_marine.csv, spatial_layers/,
- output : wrong_freshwater_sequences.csv
Rscript 00-scripts/step6/check_freshwater_assignation.R
- Assign habitat (demersal, pelagic...) information to each individual sequences according to their attributed species name
- input : map_marine_sequences.csv
- output : sequences_withdemerpelag.csv
Rscript 00-scripts/step6/sequences_demerpelag.R
- Cure habitat assignation and species name for each individual sequences
- input : sequences_withdemerpelag.csv
- output : cured_sequences_withdemerpelag.csv
Rscript 00-scripts/step6/sequences_cure_species_name.R
- cure family column by renaming BOLD family by its equivalent into NCBI taxonomy
bash 00-scripts/step6/rename_family_bold_to_ncbi.sh
- Write a table of number of species/number of sequences by taxonomic order/family used for each model
python3 00-scripts/step6/sequences_taxonomy.py