Releases: iquasere/UPIMAPI
Simplified database download
When inputting a database, there are three options:
- input one of three reserved values:
uniprot
,swissprot
ortaxids
- input a FASTA database
- input a DIAMOND formatted database
UPIMAPI will first check if the DIAMOND version of the databases exist, and if it finds it, will run annotation with it.
- in
--resources-directory
folder, eitheruniprot.dmnd
,uniprot_sprot.dmnd
ortaxids_database.dmnd
- the database filename with termination replaced with
.dmnd
- the database filename itself
If that doesn't exist, UPIMAPI will search for the FASTA format, and if it finds it, will convert to DIAMOND format.
- in
--resources-directory
folder, eitheruniprot.fasta
,uniprot_sprot.fasta
ortaxids_database.fasta
- the database filename itself
- will exit with file not found error
This removes the need to tinker with the --skip-db-check
parameter, but more trust is placed on the user.
Sanitization of mapping columns
Wrong columns can no longer be inputted
Now UPIMAPI will report an error and exit with a code different from 0.
New command for showing available fields
upimapi --show available-fields
will print the columns available for ID mapping. Properly capitalized, directly extracted from the return fields page.
Fixed parsing of custom inputted "-cols"
In handling the columns Organism
, Organism (ID)
, Taxonomic lineage
and Taxonomic lineage IDs
, when some of Taxonomic lineage (LEVEL)
or Taxonomic lineage IDs (LEVEL)
columns are specified.
UPIMAPI now properly adds and discards columns through its execution, obeying the respective conditions.
Also, UPIMAPI now detects if input ends in a compressed format, i.e., if an input file is specified and ends with .zip
, .tar
, .gz
or .bz2
, UPIMAPI will stop executing and will exit.
Fixed handling taxonomic columns
Columns were not being parsed correctly. Repeated columns were being outputted, i.e., Taxonomic lineage (SPECIES)
and Taxonomic lineage IDs (SPECIES)
.
Also simplified repo structure extensively, put all into cicd
folder.
Sorted the input of taxonomic columns
Specifying taxonomic columns (e.g., Taxonomic lineage (SPECIES)
, Taxonomic lineage IDs (SUPERKINGDOM)
) was always outputting the columns Taxonomic lineage
and Taxonomic lineage (Ids)
.
These columns are no longer outputted if not called for.
Also, several fixes
Fixed outputting taxonomy with extra space (e.g. Bacteria
-> Bacteria
).
Fixed case where no additional IDs are mapped, it was throwing error.
Fixed case where no columns are inputted.
Fixed getting fasta - request was badly formatted.
From/To ID mapping implemented
Implemented the ID mapping available at https://www.uniprot.org/id-mapping triggered when "From database" and "To database" are different to the default values - "UniProtKB AC/ID" and "UniProtKB".
Two new parameters: --from-db
and --to-db
. Possible values for these can be consulted by consulting the information at https://rest.uniprot.org/configure/idmapping/fields
They can also be checked on by inputting a wrong value to the parameter. Possible options will show up.
UPIMAPI will end execution after performing this new ID mapping. It can't be combined with the ID mapping that obtains columns of information from UniProt.
Re-added pyyaml as dependency, as api_info
is now obtained again, and used directly.
Columns outputted in order of input
Columns were being outputted in random orders, because of set
commands among the code of UPIMAPI.
Columns are now properly outputted in the order that they are specified by input of the user.
Fix on default memory
When memory is inputted with --max-memory
, UPIMAPI assumes it comes as Gb.
Default in UPIMAPI (when not explicitly inputting) was cheking for available memory, which comes in bytes. This lead to values in memory too large, that lead to values of block-size
too small, and the reference database would be split in too many blocks. Then, UPIMAPI/DIAMOND would take forever.
Now, UPIMAPI parses default memory to Gb before determining block-size
and number-of-chunks
.
Important and nice options for homology search
Added control over DIAMOND search
--diamond-mode
accepts six options (by decreasing search time and increasing sensibility): fast
, mid_sensitive
, sensitive
, more_sensitive
, very_sensitive
and ultra_sensitive
.
Helps to dramatically decrease search times, but also reduce memory usage and apparently disk usage as well (no ideia why this one).
Added parameter for max memory
Set with --max-memory
, read as float in Gb.
Allows to calculate DIAMOND parameters b and c automatically.
Also two small bug fixes
Fixed the case where database was inputted with --skip-db-check
and as a FASTA file - UPIMAPI would input the FASTA database directly to DIAMOND.
Fixed outputting days as float. Days don't float.
Added selection of mirror to download UniProt from
New parameter --mirror
to determine where to download UniProt. It allows the following options:
expasy:
https://ftp.expasy.orguniprot:
https://ftp.uniprot.org/pubebi:
https://ftp.ebi.ac.uk/pub
from where to download SwissProt and TrEMBL. More information at https://www.uniprot.org/help/downloads