add checkm2 #6542

astrovsky01 · 2024-11-08T16:48:48Z

FOR CONTRIBUTOR:

I have read the CONTRIBUTING.md document and this tool is appropriate for the tools-iuc repo.
License permits unrestricted use (educational + commercial)
This PR adds a new tool or tool collection
This PR updates an existing tool or tool collection
This PR does something else (explain below)

bgruening · 2024-11-09T09:44:22Z

tools/checkm2/.shed.yml

+description: Rapid assessment of genome bin quality using machine learning
+long_description: Enhanced version of checkm, using machine learning models for greater speed and accuracy
+homepage_url: https://github.com/chklovski/CheckM2
+remote_repository_url: https://github.com/galaxyproject/tools-iuc/


This could be more precise pointing to the folder.

bgruening · 2024-11-09T09:45:18Z

tools/checkm2/checkm2.xml

+    <command detect_errors="exit_code"><![CDATA[
+    mkdir input_dir &&
+    #for $i, $file in enumerate($input):
+        cp $file input_dir/${file.element_identifier}.dat &&


single-quotes

can we symlink?

element_identifier needs cleaning (eg using re.sub).

bgruening · 2024-11-09T14:07:32Z

tools/checkm2/checkm2.xml

+            <when value="no"/>
+            <when value="yes">
+                <!-- It's not all numbers and there's a check internally if it's in a specific list, so it had to be spelled out -->
+                <param argument="ttable" type="select" label="Prodigal table">


It would be useful for the user to tell what those numbers mean.

Maybe use the code table names for the text https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi

bgruening · 2024-11-09T14:08:29Z

tools/checkm2/checkm2.xml

+    #end if
+    -x .dat
+    --threads "\${GALAXY_SLOTS:-1}"
+    --database_path "\${CHECKM2_DB_PATH:-$__tool_directory__/tool-data/CheckM2_database/uniref100.KO.1.dmnd}"


Is this database stable forever and will not change? If those databases update over time, we need a location file.

bernt-matthias

Excellent timing: One of my users just asked for the tool :)

Could contribute a data manager.

bernt-matthias · 2024-11-15T12:58:29Z

tools/checkm2/checkm2.xml

+        <token name="@IDX_DATA_TABLE@">checkm2_db_versioned</token>
+    </macros>
+    <xrefs>
+        <xref type="bio.tools">dada2</xref>


bernt-matthias · 2024-11-15T13:00:22Z

tools/checkm2/checkm2.xml

+    <command detect_errors="exit_code"><![CDATA[
+    mkdir input_dir &&
+    #for $i, $file in enumerate($input):
+        cp $file input_dir/${file.element_identifier}.dat &&


can we symlink?

element_identifier needs cleaning (eg using re.sub).

bernt-matthias · 2024-11-15T13:04:15Z

tools/checkm2/checkm2.xml

+            <option value="--specific">Force the use of the specific quality prediction model (neural network)</option>
+            <option value="--allmodels">Output quality prediction for both models for each genome.</option>
+        </param>
+        <conditional name="ttable_manual">


Do not use a conditional. Instead an optional select be used.

bernt-matthias · 2024-11-15T13:05:03Z

tools/checkm2/checkm2.xml

+            <when value="no"/>
+            <when value="yes">
+                <!-- It's not all numbers and there's a check internally if it's in a specific list, so it had to be spelled out -->
+                <param argument="ttable" type="select" label="Prodigal table">


Maybe use the code table names for the text https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi

bernt-matthias · 2024-11-15T13:05:49Z

tools/checkm2/checkm2.xml

+    #end if
+    -x .dat
+    --threads "\${GALAXY_SLOTS:-1}"
+    --database_path $database.fields.path


also should be single quoted.

bernt-matthias · 2024-11-15T13:12:28Z

tools/checkm2/checkm2.xml

+    <inputs>
+        <param name="input" type="data" format="fasta" label="Input MAG/SAG datasets" multiple="true"/>
+
+        <param name="database" type="select" label="Select reference genome" help="Checkm2 Diamond database">


Would it be of interest if users upload their own dmnd databases (https://github.com/galaxyproject/tools-iuc/blob/main/tools/diamond/diamond_makedb.xml)?

Based on the wording on their repo, I was under the impression you needed to use their specific diamond db?

bernt-matthias · 2024-11-15T13:13:39Z

tools/checkm2/checkm2.xml

+    <outputs>
+        <data name="quality" label="${tool.name} on ${on_string}: Quality report" format="tabular" from_work_dir="output/quality_report.tsv"/>
+        <collection name="protein_files" label="${tool.name} on ${on_string}: protein files" type="list">
+            <discover_datasets pattern="__name__" format="fasta" directory="output/protein_files"/>


The extension of the files will be part of the element identfiers. Should we remove them?

is there a difference between ext and format (see below)

remove dbkey column rename tables

…to checkm2

bgruening · 2024-11-16T11:03:21Z

tools/checkm2/checkm2.xml

+    mkdir input_dir &&
+    #for $i, $file in enumerate($input):
+        #set $cleaned =  re.sub('[^\s\w\-\\.]', '_', str($file.element_identifier))
+        ln -s $file input_dir/${cleaned}.dat &&


Suggested change

ln -s $file input_dir/${cleaned}.dat &&

ln -s '$file' input_dir/${cleaned}.dat &&

bgruening · 2024-11-16T11:04:46Z

tools/checkm2/tool-data/checkm2.loc.sample

+#The <version> column indicates the checkm2 version that generated the database
+
+#
+#diamond_db_1.0.2	Diamond database	/mnt/galaxyIndices/Checkm2_database/uniref100.KO.1.dmnd	1.0.2


maybe the version before the path? I guess this is what we do in other location files.

bgruening · 2024-11-16T11:05:46Z

tools/checkm2/checkm2.xml

+        <collection name="protein_files" label="${tool.name} on ${on_string}: protein files" type="list">
+            <discover_datasets pattern="__name__" format="fasta" directory="output/protein_files"/>
+        </collection>
+        <collection name="diamond_files" label="${tool.name} on ${on_string}: Diamond files" type="list">


Should we add here that this is of type tabular

bgruening · 2024-11-16T11:06:35Z

tools/checkm2/checkm2.xml

+    <outputs>
+        <data name="quality" label="${tool.name} on ${on_string}: Quality report" format="tabular" from_work_dir="output/quality_report.tsv"/>
+        <collection name="protein_files" label="${tool.name} on ${on_string}: protein files" type="list">
+            <discover_datasets pattern="__name__" format="fasta" directory="output/protein_files"/>


is there a difference between ext and format (see below)

Alexander OSTROVSKY added 6 commits November 8, 2024 11:46

add checkm2

cf9551f

typo

15149f8

add fail state because db can't be on github

bde6c07

fix error codes

d96651c

fix

f3ad7c5

fix space

106d6bf

bgruening reviewed Nov 9, 2024

View reviewed changes

add database

31f692d

astrovsky01 marked this pull request as draft November 13, 2024 00:12

bernt-matthias reviewed Nov 15, 2024

View reviewed changes

Alexander OSTROVSKY and others added 4 commits November 15, 2024 10:27

bernt-matthias comments

857f362

lint and test fix

7862720

data table tweaks

1942fb2

remove dbkey column rename tables

Merge branch 'checkm2' of https://github.com/astrovsky01/tools-iuc in…

abf47ab

…to checkm2

astrovsky01 marked this pull request as ready for review November 15, 2024 20:15

add re import

28e5048

bgruening reviewed Nov 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add checkm2 #6542

add checkm2 #6542

astrovsky01 commented Nov 8, 2024

bgruening Nov 9, 2024

bgruening Nov 9, 2024

bernt-matthias Nov 15, 2024

bgruening Nov 9, 2024

bernt-matthias Nov 15, 2024

bgruening Nov 9, 2024

bernt-matthias left a comment

bernt-matthias Nov 15, 2024

bernt-matthias Nov 15, 2024

bernt-matthias Nov 15, 2024

bernt-matthias Nov 15, 2024

bernt-matthias Nov 15, 2024

bernt-matthias Nov 15, 2024

astrovsky01 Nov 15, 2024

bernt-matthias Nov 15, 2024

bgruening Nov 16, 2024

bgruening Nov 16, 2024

bgruening Nov 16, 2024

bgruening Nov 16, 2024

bgruening Nov 16, 2024

	ln -s $file input_dir/${cleaned}.dat &&
	ln -s '$file' input_dir/${cleaned}.dat &&

add checkm2 #6542

Are you sure you want to change the base?

add checkm2 #6542

Conversation

astrovsky01 commented Nov 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bernt-matthias left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment