Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update biobox_add_taxid wrapper #6344

Closed

Conversation

SantaMcCloud
Copy link
Contributor

FOR CONTRIBUTOR:

  • I have read the CONTRIBUTING.md document and this tool is appropriate for the tools-iuc repo.
  • License permits unrestricted use (educational + commercial)
  • This PR adds a new tool or tool collection
  • This PR updates an existing tool or tool collection
  • This PR does something else (explain below)

@SantaMcCloud
Copy link
Contributor Author

I change the select column from data_column to integer, since it can happen that you have multiple files as input and all files share the same column as input. With data_column it can not work since my workflow generate a collection which stop it there since it has now data as reference for the column or at least it still throws an error if you try to use it this way.

@SantaMcCloud
Copy link
Contributor Author

SantaMcCloud commented Sep 20, 2024

@bgruening could you merge this quick to have this in the bot update scope? I really need it this weekend!

@bgruening
Copy link
Member

Can you please explain this more.

This change seems backwards and maybe there is a Galaxy bug to fix.
Don't worry about the updates, I can install tools during the week if urgent.

@SantaMcCloud
Copy link
Contributor Author

Yes, I can explain it more.

So in this workflow https://usegalaxy.eu/u/santinof/w/gtdb-tk-subworkflow-1 there is the possibility that GTDB-Tk will output 2 summary files. These 2 files will run through 2 other tools. The last tool is Names2taxID which is needed for this tool as input. Here you have now the Problem that you have to set the column where the names are stated in the Names2taxID output but since you have 2 files Galaxy can not refer to a file with the data_colum type. So I did change it to use an integer as a workaround because both file has the same format so both share the same column which has to be stated.

Here is the error msg:

parameter 'column': Dataset 'None' for data_ref attribute 'taxonkit' of parameter 'column' is not a DatasetInstance

and here a History as an example:
https://usegalaxy.eu/u/santinof/h/mag-benchmark-workflow-without-batcami-low-1

Hope this will explain the change if not i can give more details about it!

@SantaMcCloud
Copy link
Contributor Author

Okay, maybe this change doesn't need to be done. I find it strange that when I only got 1 summary file from GTDB-Tk that I end up with a list of a list instead of 2 files in a list. I only saw this now since I did let my workflow run till the error such that I can use the data to work with them manually to get some result. There was one History where this error did not appear, since GTDB-Tk did yield a list with 2 files and not a list of a list.

I will now try the workflow with the flatten tool to see if I cut out the error or not, and I will either close the PR or I will give more details in here

@SantaMcCloud
Copy link
Contributor Author

The workflow still got the error this time in both runs:

https://usegalaxy.eu/u/santinof/h/mag-benchmark-workflow-without-batcami-low-3
https://usegalaxy.eu/u/santinof/h/mag-benchmark-workflow-without-batmarine-sample-0-3

Here is the history where the tool did work, only thing different was that the output from Names2taxID was not flattened before inputting into biobox add taxid.

https://usegalaxy.eu/u/santinof/h/mag-benchmark-workflow-without-batmarine-sample-0-2

@SantaMcCloud
Copy link
Contributor Author

SantaMcCloud commented Sep 23, 2024

image
image

This should not happen that the flatten tool did run on a collection which was not created yet?
This could be the bug? Since Names2taxID will create a list of a list and with flatten it should be a list only but since it did run it take over the list of a list.... really strange......

In the linked History where the error did not appear, it seems that the flatten tool work there but only there.....

image

After Name2taxID was run I did try flatten again and there you can see it work, so I think there is a bug in galaxy with flatten?

For my workflow i try a workaround to see if using a subworkflow to see of flatten work there since it is forced to wait for the result

@paulzierep
Copy link
Contributor

Mhh, I can only assume this, but the input in the history you provided is list(samples):list(summary files); I assume as such, the tool wants to get the column from the first level (which is a collection not a file), maybe we could just merge the summary files (one is for archaea and one for bacteria, right ? To overcome the difficult to handle collection structure ?

@paulzierep
Copy link
Contributor

In general, I am wondering how the logic of multiplechosen for taxonkit and data_column works, since the data_column can only choose from one file. Maybe using an integer in this case is a good workaround

@paulzierep
Copy link
Contributor

@SantaMcCloud
Copy link
Contributor Author

Mhh, I can only assume this, but the input in the history you provided is list(samples):list(summary files); I assume as such, the tool wants to get the column from the first level (which is a collection not a file), maybe we could just merge the summary files (one is for archaea and one for bacteria, right ? To overcome the difficult to handle collection structure ?

Correct this way i want to use tha flatten tool to have all dataset on one Level but in the exapanation above Show that this tool runs wirhout waiting for the needed outputs. Even when merge them when we have the list:list Situation it will still yield this error to see this you can see in the cami error worklow LinkedIn above there Names2axID have only 1 files Aa output but still in the list:list dataype which means it does not work

@SantaMcCloud
Copy link
Contributor Author

In general, I am wondering how the logic of multiplechosen for taxonkit and data_column works, since the data_column can only choose from one file. Maybe using an integer in this case is a good workaround

For the stuff i tested the data_column param type can still be used when habe mutlipe files. The only Problem which can happen is that the mutlipe file does not have any specific format which means that the chopse column is not the same all over each file.

The error is still showed when trying to is manually but Galaxy still runs the tool. You can see this in the not error history (marine-sample-0-2) linked above. There you can try to run biobox add taxid to see the "error" msg in the column para GUI

@SantaMcCloud
Copy link
Contributor Author

Can you also explain why there can be multiple inputs here: https://github.com/galaxyproject/tools-iuc/blob/303002db06287fb25306020c4391626842f52162/tools/cami_amber/biobox_add_taxid.xml#L86C23-L86C115

Can you name the input which i should explain more? :)

@bernt-matthias
Copy link
Contributor

So the main problem here is that you have nested lists. Is this expected or a potential problem of the tools running upstream in the workflow? I do not understand yet: does flattening the collection not help?

@SantaMcCloud
Copy link
Contributor Author

So the main problem here is that you have nested lists. Is this expected or a potential problem of the tools running upstream in the workflow? I do not understand yet: does flattening the collection not help?

Correct and it is not expected since only want a list as input. How this happens I can not explain, but for this I build in the flatten tool to eliminate the nested list.

Now to the real problem: It seems that show here

image image

That flatten will be executed right after the job is created, which does not follow the workflow logic since it should have waited for the Names2taxID did finish since this is the input.

I now try to work around with that, I split my Subworkflow into 2 other Workflow such that flatting will be in the second and force (hopefully) to wait till all outputs from the first Subworkflow are created.

I hope this help understanding the Problem a bit better?

@bernt-matthias
Copy link
Contributor

Now to the real problem:

Might be also a problem, but I think your primary problem is that an upstream tools produces a nested list and you/we need to understand why.

@SantaMcCloud
Copy link
Contributor Author

SantaMcCloud commented Sep 23, 2024

Okay now I know how the nested list will be generated. It is because of a batch mode of a different tool which is expected since it can happen that GTDB-Tk can produce 2 files which has to be in the upstream.

Now I have a question, since I didn't find it is there a tool to merge 2 TSV files to one file where the content will be merged by row and not by column? This might work as a problem solver or to change this tool such that the param is an integer and not data_column

@bernt-matthias
Copy link
Contributor

There are quite a few tools to concatenate files (one below the other), e.g. https://usegalaxy.eu/?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fbgruening%2Ftext_processing%2Ftp_cat%2F9.3%2Bgalaxy1&version=latest

For pasting (adding new columns) https://usegalaxy.eu/?tool_id=Paste1&version=latest

Will close this here. Feel free to reopen if you still think its a bug. Otherwise we can continue discussion at gitter

or of course https://help.galaxyproject.org/

@bgruening
Copy link
Member

@SantaMcCloud there was also a fix, maybe related, in Galaxy, so check out latest EU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants