feat!: latest development for new release (#133)

* chore: Update development (#128) * docs: enhancing documentation * docs: better quickstart * chore: ubdate github actions to setup-micromamba * docs: remove default channel from environment file * docs: improvements, like QC report (#125) * added .DS_Store to gitignore. * Fixed the overflow of the features section by using the table. * Fixed the broked report link. * Sample QC report HTML file * Added the link to the QC report in experiment. * Added the assignment QC report. * Add link to QC report in assignment documentation * Update documentation in quickstart.rst. Fixed typos and gramatical mistakes. * Update documentation in index.rst. Fix typos and grammatical mistakes. * Fix typo in installation documentation * Refactor documentation in config.rst --------- Co-authored-by: Max <visze@users.noreply.github.com> * docs: Fixed the link for the QC report in Experiment and Assignment (#126) * added .DS_Store to gitignore. * Fixed the overflow of the features section by using the table. * Fixed the broked report link. * fixed typo project * Typo fix controlled * Sample QC report HTML file * Added the link to the QC report in experiment. * Added the assignment QC report. * Add link to QC report in assignment documentation * Update documentation in quickstart.rst. Fixed typos and gramatical mistakes. * Update documentation in index.rst. Fix typos and grammatical mistakes. * Fix typo in installation documentation * Refactor documentation in config.rst * Update documentation links in assignment.rst and experiment.rst * Testing the iframe html file. * Update documentation links in assignment.rst and experiment.rst --------- Co-authored-by: Max <visze@users.noreply.github.com> * chore: delete not necessary files * docs: automatic versioning * style: automatic version printing of MPRAsnakeflow * fix: memory resources for bbmap (#123) * fix: add memory resources for bbmap * set lower memm in bbmap workflow profile * increasing memory for bmap --------- Co-authored-by: Max Schubach <max.schubach@bih-charite.de> Co-authored-by: Max Schubach <max.schubach@bihealth.de> * fix: Detach from anaconda (#122) * fix: detach from anaconda. Remove defaults conda channels * fixing linting errors * update hashes in dockerfile from lining errors --------- Co-authored-by: Max Schubach <max.schubach@bih-charite.de> * chore(master): release MPRAsnakeflow 0.1.1 (#124) * chore(master): release MPRAsnakeflow 0.1.1 * Update .release-please-manifest.json * Update version.txt * Update CHANGELOG.md --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Max <visze@users.noreply.github.com> * forgot to upgrade two envs * docs: correct link in docs badge --------- Co-authored-by: Max Schubach <max.schubach@bih-charite.de> Co-authored-by: Ali <69039717+bioinformaticsguy@users.noreply.github.com> Co-authored-by: Max Schubach <max.schubach@bihealth.de> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * feat!: igvf outputs (#129) * refactor: removed statistics from final barcode to oligo map * refactor outputs * fix scripts due to renaming headers * fix assignment statistic due to new output * refactor!: moving files. not attched counts are not used as well as median for scaling * adding logs --------- Co-authored-by: Max Schubach <max.schubach@bih-charite.de> * chore!: supporting only snakemake >=8.24.1 (#130) Co-authored-by: Max Schubach <max.schubach@bih-charite.de> * refactor!: No min max length for bbmap. default mapq is 30. (#131) Changes for bbmap * no min an max for sequence length and start. (like exact matching) * using default of 30 mapq instead of 35 * feat!: outlier removal (#132) * feat!: outlier detection Might break older config files * docs: update documentation for bbmap, apptainer and outlier removal * use abs for zscore * trying to fix outlier via zscore * mad code change * change outlier removal default to zscore --------- Co-authored-by: Max Schubach <max.schubach@bih-charite.de> * edit config --------- Co-authored-by: Max Schubach <max.schubach@bih-charite.de> Co-authored-by: Ali <69039717+bioinformaticsguy@users.noreply.github.com> Co-authored-by: Max Schubach <max.schubach@bihealth.de> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
kircherlab · Nov 5, 2024 · bdfc557 · bdfc557
1 parent b7f4cfd
commit bdfc557
Show file tree

Hide file tree

Showing 33 changed files with 671 additions and 399 deletions.
diff --git a/.gitignore b/.gitignore
@@ -7,6 +7,9 @@ logs
 !config/*
 !resources
 !resources/**
+resources/**/.local
+resources/**/.cache
+resources/**/.ipython
 !workflow
 !workflow/**
 !.gitattributes
@@ -27,4 +30,4 @@ mix_data
 *report.html
 *.simg
 *results
-.DS_Store
+.DS_Store
diff --git a/README.md b/README.md
@@ -1,7 +1,7 @@
 # Snakemake workflow: MPRAsnakeflow
 
 [![Documentation Status](https://readthedocs.org/projects/mprasnakeflow/badge/?version=latest)](https://mprasnakeflow.readthedocs.io/latest/?badge=latest)
-[![Snakemake](https://img.shields.io/badge/snakemake-≥7.2.1-brightgreen.svg)](https://snakemake.bitbucket.io)
+[![Snakemake](https://img.shields.io/badge/snakemake-≥8.24.1-brightgreen.svg)](https://snakemake.github.io/)
 [![Tests](https://github.com/kircherlab/MPRAsnakeflow/actions/workflows/main.yml/badge.svg)](https://github.com/kircherlab/MPRAsnakeflow/actions/workflows/main.yml)
 
 This pipeline processes sequencing data from Massively Parallel Reporter Assays (MPRA) to create count tables for candidate sequences tested in the experiment.
@@ -33,17 +33,17 @@ Create or adjust the `config/example_config.yaml` in the repository to your need
 
 ### Step 3: Install Snakemake
 
-Install Snakemake (recommended version >= 8.x) using [conda](https://conda.io/projects/conda/en/latest/user-guide/install/index.html) or [mamba](https://mamba.readthedocs.io/en/latest/installation/mamba-installation.html) (recommended installation via [miniforge](https://github.com/conda-forge/miniforge)):
+Install Snakemake (version >= 8.24.1) using [conda >24.7.1](https://conda.io/projects/conda/en/latest/user-guide/install/index.html) (recommended installation via [miniforge](https://github.com/conda-forge/miniforge)):
 
-    mamba create -c bioconda -n snakemake snakemake
+    conda create -c bioconda -n snakemake snakemake
 
 For installation details, see the [instructions in the Snakemake documentation](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html).
 
 ### Step 4: Execute workflow
 
 Activate the conda environment:
 
-    mamba activate snakemake
+    conda activate snakemake
 
 Test your configuration by performing a dry-run via
 
@@ -58,9 +58,6 @@ using `$N` cores or run it in a cluster environment (here SLURM) via the [slurm
     snakemake --software-deployment-method conda --executor slurm --cores $N --configfile config.yaml --workflow-profile profiles/default
 
 Please note that `profiles/default/config.yaml` has to be adapted to your needs (like partition names).
-For snakemake 7.x this might work too using slurm sbatch (but depricated in newer snakemake versions:
-
-    snakemake --use-conda --configfile config.yaml --cluster "sbatch --nodes=1 --ntasks={cluster.threads} --mem={cluster.mem} -t {cluster.time} -p {cluster.queue} -o {cluster.output}" --jobs 100 --cluster-config config/sbatch.yaml
 
 
 Please note that the log folder of the cluster environment has to be generated first, e.g:
@@ -71,7 +68,7 @@ For other cluster environments please check the [Snakemake](https://snakemake.re
 
 If you not only want to fix the software stack but also the underlying OS, use
 
-    snakemake --sdm apptainer,conda --cores $N --configfile config.yaml --workflow-profile profiles/default
+    snakemake --sdm apptainer conda --cores $N --configfile config.yaml --workflow-profile profiles/default
 
 in combination with any of the modes above. This will use a pre-build singularity container of MPRAsnakeflow with the conda ens installed in.
 

diff --git a/config/example_assignment_bbmap.yaml b/config/example_assignment_bbmap.yaml
@@ -8,13 +8,9 @@ assignments:
     alignment_tool:
       tool: bbmap
       configs:
-        min_mapping_quality: 35 # integer >=0. 35 is default
-        sequence_length: # sequence length of design excluding adapters.
-          min: 166
-          max: 175
-        alignment_start: # start of an alignment in the reference/design_file. Here using 15 bp adapters. Can be different when using adapter free approaches
-          min: 1 # integer
-          max: 3 # integer
+        min_mapping_quality: 30 # 30 is default for bbmap
+        sequence_length: 171 # sequence length of design excluding adapters.
+        alignment_start: 1 # start of an alignment in the reference/design_file. Here using 15 bp adapters. Can be different when using adapter free approaches
     FW:
       - resources/Assignment_BasiC/R1.fastq.gz
     BC:

diff --git a/config/example_config.yaml b/config/example_config.yaml
@@ -6,9 +6,9 @@ assignments:
   exampleAssignment: # name of an example assignment (can be any string)
     bc_length: 15
     alignment_tool:
-      tool: exact # bbbmap, bwa or exact
+      tool: exact # bbmap, bwa or exact
       configs:
-        sequence_length: 170 # sequence length of design excluding adapters.
+        sequence_length: 171 # sequence length of design excluding adapters.
         alignment_start: 1 # start of the alignment in the reference/design_file
     FW:
       - resources/assoc_basic/data/SRR10800986_1.fastq.gz

diff --git a/docs/assignment.rst b/docs/assignment.rst
@@ -73,7 +73,7 @@ Mandatory arguments:
   :\-\-configfile:
     Specify or overwrite the config file of the workflow (see the docs). Values specified in JSON or YAML format are available in the global config dictionary inside the workflow. Multiple files overwrite each other in the given order. Thereby missing keys in previous config files are extended by following configfiles. Note that this order also includes a config file defined in the workflow definition itself (which will come first). (default: None)
   :\-\-sdm:             
-    **Required to run MPRAsnakeflow.** : :code:`--sdm conda` or :code:`--sdm apptainer` Uses the defined conda environment per rule. We highly recommend to use apptainer where we build a predefined docker container with all software installewd within it. :code:`--sdm conda` teh conda envs will be installed by the first excecution of the workflow. If this flag is not set, the conda/apptainer directive is ignored. (default: False)
+    **Required to run MPRAsnakeflow.** : :code:`--sdm conda` or :code:`--sdm apptainer conda` Uses the defined conda environment per rule. We highly recommend to use apptainer where we build a predefined docker container with all software installewd within it. :code:`--sdm conda` teh conda envs will be installed by the first excecution of the workflow. If this flag is not set, the conda/apptainer directive is ignored. (default: False)
 Recommended arguments:
   :\-\-snakefile:             
     You should not need to specify this. By default, Snakemake will search for 'Snakefile', 'snakefile', 'workflow/Snakefile','workflow/snakefile' beneath the current working directory, in this order. Only if you definitely want a different layout, you need to use this parameter. This is very usefull when you want to have the results in a different folder than MPRAsnakeflow is in. (default: None)

diff --git a/docs/cluster.rst b/docs/cluster.rst
@@ -32,10 +32,10 @@ Using the slurm excecutor plugin running 300 jobs in parallel.
     snakemake --sdm conda --configfile config/config.yaml -j 300  --workflow-profile profiles/default --executor slurm
 
 
-Snakemake 7
------------
+Snakemake 7 (not supported anymore)
+-------------------------------------
 
-Here we used the :code:`--cluster` option which is not anymo,onger available in snakemake 8. You can also use the predefined `config/sbatch.yaml` but this might be outdated and we highly recommend to use resources with the workfloe profile. 
+Here we used the :code:`--cluster` option which is not available in snakemake 8. You can also use the predefined `config/sbatch.yaml` but this might be outdated and we highly recommend to use resources with the workfloe profile. 
 
 .. code-block:: bash
 

diff --git a/docs/config.rst b/docs/config.rst
@@ -47,19 +47,19 @@ For each assignment you want to process you have to give him a name like :code:`
     Alignment tool configuration that is used to map the reads to the oligos.
 
     :tool:
-        Alignment tool that is used. Currently :code:`bwa` and :code:`exact` are supported.
+        Alignment tool that is used. Currently :code:`bbmap` :code:`bwa`, :code:`exact` are supported. Default is :code:`bbmap`.
     :configs:
         Configurations of the alignment tool selected.
 
         :sequence_length (bwa):
             Defines the :code:`min` and :code:`max` of a :code:`sequence_length` specify. :code:`sequence_length` is basically the length of a sequence alignment to an oligo in the design file. Because there can be insertion and deletions we recommend to vary it a bit around the exact length (e.g. +-5). In theory, this option enables designs with multiple sequence lengths.
         :alignment_start (bwa):
             Defines the :code:`min` and :code:`max` of the start of the alignment in an oligo. When using adapters you have to set basically the length of the adapter. Otherwise, 1 will be the choice for most cases. We also recommend varying this value a bit because the start might not be exact after the adapter. E.g. by +-1.
-        :min_mapping_quality (bwa):
-            (Optional) Defines the minimum mapping quality (MAPQ) of the alignment to an oligo. When using oligos with only 1bp difference it is recommended to set it to 1. For regions only with larger edit distances 30 or 40 might be a good choice. Default :code:`1`. 
-        :sequence_length (exact):
+        :min_mapping_quality (bwa, bbmap):
+            (Optional) Defines the minimum mapping quality (MAPQ) of the alignment to an oligo. MAPQs are different between bbmap and bwa. For bwa: When using oligos with only 1bp difference it is recommended to set it to 1. BBMap is better here and we can use for example 30 or 35- For regions only with larger edit distances 30 or 40 might be a good choice. Default :code:`30` (use bbmap). 
+        :sequence_length (exact, bbmap):
             Defines the :code:`sequence_length` which is the length of a sequence alignment to an oligo in the design file. Only one length design is supported.
-        :alignment_start (exact):
+        :alignment_start (exact, bbmap):
             Defines the start of the alignment in an oligo. When using adapters you have to set basically the length of the adapter. Otherwise, 1 will be the choice for most cases.
 
 :bc_length:
@@ -168,16 +168,22 @@ The experiment workflow is configured in the :code:`experiments` section. Each e
 
         :bc_threshold:
             Minimum number of different BCs required per oligo. A higher value normally increases the correlation betwene replicates but also reduces the number of final oligos. Default option is :code:`10`.
-        :DNA:
-            Settings for DNA
-
-            :min_counts:
-                Mimimum number of DNA counts per barcode. When set to :code:`0` a pseudo count is added. Default option is :code:`1`.
-        :RNA:
-            Settings for DNA
+        :min_dna_counts:
+            Mimimum number of DNA counts per barcode. When set to :code:`0` a pseudo count is added. Default option is :code:`1`.
+        :min_rna_counts:
+            Mimimum number of RNA counts per barcode. When set to :code:`0` a pseudo count is added. Default option is :code:`1`.
+        :outlier_detection:
+            (Optional) Outlier detection. Methods and strategies to remove outlier barcodes in the final counts. The following options are possible:
+
+            :method:
+                Method to remove outliers. Currently :code:`rna_counts_zscore`, :code:`ratio_mad` or :code:`none` (no outlier detection) are supported. Default option is :code:`rna_counts_zscore`.
+            :mad_bins:
+                (Optional) For method :code:`ratio_mad`:  Number of bins for the median absolute deviation (MAD) method. Default option is :code:`20`.
+            :times_mad:
+                (Optional) For method :code:`ratio_mad`:  Times the MAD to remove outliers. Default option is :code:`5`.
+            :times_zscore:
+                (Optional) For method :code:`rna_counts_zscore`: Times the zscore to remove outliers. Default option is :code:`3`.
 
-            :min_counts:
-                Mimimum number of RNA counts per barcode. When set to :code:`0` a pseudo count is added. Default option is :code:`1`.
     :sampling:
         (Optional) Options for sampling counts and barcodes. Just for debug reasons.
 

diff --git a/docs/experiment.rst b/docs/experiment.rst
@@ -88,7 +88,7 @@ Mandatory arguments:
   :\-\-configfile:
     Specify or overwrite the config file of the workflow (see the docs). Values specified in JSON or YAML format are available in the global config dictionary inside the workflow. Multiple files overwrite each other in the given order. Thereby missing keys in previous config files are extended by following configfiles. Note that this order also includes a config file defined in the workflow definition itself (which will come first). (default: None)
   :\-\-sdm:             
-    **Required to run MPRAsnakeflow.** : :code:`--sdm conda` or :code:`--sdm apptainer` Uses the defined conda environment per rule. We highly recommend to use apptainer where we build a predefined docker container with all software installewd within it. :code:`--sdm conda` teh conda envs will be installed by the first excecution of the workflow. If this flag is not set, the conda/apptainer directive is ignored. (default: False)
+    **Required to run MPRAsnakeflow.** : :code:`--sdm conda` or :code:`--sdm apptainer conda` Uses the defined conda environment per rule. We highly recommend to use apptainer where we build a predefined docker container with all software installewd within it. :code:`--sdm conda` the conda envs will be installed by the first excecution of the workflow. If this flag is not set, the conda/apptainer directive is ignored. (default: False)
 Recommended arguments:
   :\-\-snakefile:             
     You should not need to specify this. By default, Snakemake will search for 'Snakefile', 'snakefile', 'workflow/Snakefile','workflow/snakefile' beneath the current working directory, in this order. Only if you definitely want a different layout, you need to use this parameter. This is very usefull when you want to have the results in a different folder than MPRAsnakeflow is in. (default: None)

diff --git a/docs/index.rst b/docs/index.rst
@@ -4,19 +4,19 @@
 MPRAsnakeflow's documentation
 ====================================
 
-.. image:: https://img.shields.io/badge/snakemake-≥7.7.1-brightgreen.svg
-    :target: https://snakemake.bitbucket.io
+.. image:: https://img.shields.io/badge/snakemake-≥8.24.1-brightgreen.svg
+    :target: https://snakemake.github.io/
 
-.. image:: https://img.shields.io/badge/mamba-≥4.6-brightgreen.svg
-    :target: https://docs.conda.io/en/latest/miniconda.html
+.. image:: https://img.shields.io/badge/conda->24.7.1-brightgreen.svg
+    :target: https://github.com/conda-forge/miniforge
 
 
 **Welcome!**
 
 MPRAsnakeflow pipeline processes sequencing data from Massively Parallel Reporter Assays (MPRAs)
 to create count tables for candidate sequences tested in the experiment.
 
-MPRAsnakeflow is built on top of `Snakemake <https://snakemake.readthedocs.io/>`_ (version 8 preferred) and is configured via a ``.yaml`` file.
+MPRAsnakeflow is built on top of `Snakemake <https://snakemake.readthedocs.io/>`_ (version ≥8.24.1 required) and is configured via a ``.yaml`` file.
 
 Authors
     Max Schubach (`@visze <https://github.com/visze>`_)
@@ -74,7 +74,7 @@ Features
    * - Option
      - Description
    * - ``--software-deployment-method``
-     - When ``conda`` is set, the utility uses mamba to efficiently query repositories and query package dependencies. MPRAsnakeflow also can use containers via apptainer by using ``--software-deployment-method apptainer``. Recommended option: ``--software-deployment-method conda apptainer``
+     - When ``conda`` is set, the utility uses conda to efficiently query repositories and query package dependencies. MPRAsnakeflow also can use containers via apptainer by using ``--software-deployment-method apptainer conda``. This will use a container to run all rules but inside it will activate the pre-installed conda environments. Recommended option: ``--software-deployment-method apptainer conda``
    * - ``--cores``
      - This utility sets the number of cores (``$N``) to be used by MPRAsnakeflow.
    * - ``--configfile``

diff --git a/docs/install.rst b/docs/install.rst
@@ -19,7 +19,7 @@ Package management
 
 .. code-block:: bash
 
-    conda (mamba) 4.6 or above
+    conda >24.7.1 or above
 
 Download here: https://github.com/conda-forge/miniforge
 
@@ -36,7 +36,7 @@ Workflow language
 
 .. code-block:: bash
 
-    snakemake 8.16.0 or above (snakemake >=7.15.1 will also work but cli might be different as here documented)
+    snakemake 8.24.1 or above
 
 Download here: https://snakemake.readthedocs.io/
 
@@ -47,17 +47,17 @@ Clone repository
 Download here: https://github.com/kircherlab/MPRAsnakeflow.git
 
 
-Set up snakemake environment with conda/mamba
+Set up snakemake environment with conda
 =============================================
 
-This pipeline uses python2.7 and python3.6 with additional R scripts in a Snakemake pipeline. The ``.yml`` files provided will create the appropriate environments and is completely handled by MPRAsnakeflow. The whole pipeline is set up to run on a Linux system.
+This pipeline uses python2.7 and python ≥3.7 with additional R scripts in a Snakemake pipeline. The ``.yml`` files provided will create the appropriate environments and is completely handled by MPRAsnakeflow. The whole pipeline is set up to run on a Linux system.
 
 Install the the conda environment. The general conda environment is called ``snakemake``.
 
 .. code-block:: bash
 
     cd MPRAsnakeflow
-    mamba create -c conda-forge -c bioconda -n snakemake snakemake
+    conda create -c conda-forge -c bioconda -n snakemake snakemake
     
     # activate snakemake
     conda activate snakemake

diff --git a/resources/assoc_basic/config.yml b/resources/assoc_basic/config.yml
@@ -8,12 +8,8 @@ assignments:
     alignment_tool:
       tool: bbmap
       configs:
-        sequence_length:
-          min: 166
-          max: 175
-        alignment_start:
-          min: 1
-          max: 3
+        sequence_length: 171
+        alignment_start: 1
     FW:
       - data/SRR10800986_1.fastq.gz
     BC:

diff --git a/resources/combined_basic/config.yml b/resources/combined_basic/config.yml
@@ -8,12 +8,8 @@ assignments:
     alignment_tool:
       tool: bbmap
       configs:
-        sequence_length:
-          min: 166
-          max: 175
-        alignment_start:
-          min: 1
-          max: 3
+        sequence_length: 171
+        alignment_start: 1
     FW:
       - data/SRR10800986_1.fastq.gz
     BC:

diff --git a/resources/count_basic/config.yml b/resources/count_basic/config.yml
@@ -13,3 +13,11 @@ experiments:
     design_file: design.fa
     configs:
       default: {}
+      outlierNone:
+        filter:
+          outlier_detection:
+            method: none
+      outlierZscore:
+        filter:
+          outlier_detection:
+            method: rna_counts_zscore