The “fake” database seems to be incomplete #96

masikol · 2023-02-09T09:09:53Z

Hello!

It seems that the “fake” ref_genome_database.tgs archive is incomplete: it lacks file(s) of 23S rRNA sequences.

And here is why I think so.
I’m trying to build a custom database for paprica. So I follow the tutorial: I downloaded the “fake” mini-database, untarred it and ran paprica-make_ref.py (I use paprica v0.7.2):

./paprica-make_ref.py \
    -ref_dir ref_genome_database \
    -download test \
    -domain bacteria \
    -cpus 4

And here is the output:

Checking for reference database directories, will create if necessary...
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
Checking files for accession /mnt/1.5_drive_0/tmp/paprica_sandbox/fake_db/ref_genome_database/bacteria/refseq/GCF_000007485.1
# [... checking more files, skip it ...]
Checking files for accession /mnt/1.5_drive_0/tmp/paprica_sandbox/fake_db/ref_genome_database/bacteria/refseq/GCF_001314225.1
[Parallel(n_jobs=-1)]: Done   4 out of  10 | elapsed:    1.0s remaining:    1.5s
[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed:    1.0s remaining:    0.4s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    1.2s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    1.2s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   0 out of   0 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   0 out of   0 | elapsed:    0.0s finished
Traceback (most recent call last):
  File "./paprica-make_ref.py", line 736, in <module>
    for record in SeqIO.parse(ref_dir_domain + 'refseq/' + d + '/' + d + '.23S.fasta', 'fasta'):
  File "/home/cager/Misc_soft/py-venv/lib/python3.8/site-packages/Bio/SeqIO/__init__.py", line 607, in parse
    return iterator_generator(handle)
  File "/home/cager/Misc_soft/py-venv/lib/python3.8/site-packages/Bio/SeqIO/FastaIO.py", line 183, in __init__
    super().__init__(source, mode="t", fmt="Fasta")
  File "/home/cager/Misc_soft/py-venv/lib/python3.8/site-packages/Bio/SeqIO/Interfaces.py", line 47, in __init__
    self.stream = open(source, "r" + mode)
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/1.5_drive_0/tmp/paprica_sandbox/fake_db/ref_genome_database/bacteria/refseq/GCF_000007485.1/GCF_000007485.1.23S.fasta'

I see in paprica-make_ref.py:check_directory that the script checks if a file *23S.fasta exists (line 387). And if some required files don’t exist, the whole directory gets removed (lines 399-400). Indeed, after I run paprica-make_ref.py, all GCF_* directories in ref_genome_database/bacteria/refseq/ become empty.

And indeed, in a freshly downloaded ref_genome_database/bacteria/refseq/, all GCF_* directories don’t contain a file ending with 23S.fasta.

Is the archive really misconfigured? Or I’m getting something wrong?

The text was updated successfully, but these errors were encountered:

bowmanjeffs · 2023-02-09T14:17:48Z

Maxim, You are correct, I haven’t provided the “fake” database in a long time (I’m actually surprised you were able to find it!). Is it critical for what you need to do? I generally recommend against trying to build your own database and encourage you to use the one we provide. Jeff ======================= Jeff Bowman Associate Professor Scripps Institution of Oceanography www.polarmicrobes.org From: Maxim Sikolenko ***@***.***> Sent: Thursday, February 9, 2023 1:10 AM To: bowmanjeffs/paprica ***@***.***> Cc: Subscribed ***@***.***> Subject: [bowmanjeffs/paprica] The “fake” database seems to be incomplete (Issue #96) Hello! It seems that the “fake” ref_genome_database.tgs archive is incomplete: it lacks file(s) of 23S rRNA sequences. And here is why I think so. I’m trying to build a custom database for paprica. So I follow the tutorial <https://urldefense.com/v3/__https:/www.polarmicrobes.org/building-the-paprica-database/__;!!Mih3wA!DpgfImjzkfFJD9zTyS61-6FKyl_QRqS_UXl-H0twKVamlTFyDs79UKo_Is2yzRvQKdZ3plTVnC4od46PZ2roIl3TjQ$> : I downloaded the “fake” mini-database <https://urldefense.com/v3/__http:/www.polarmicrobes.org/extras/ref_genome_database.tgz__;!!Mih3wA!DpgfImjzkfFJD9zTyS61-6FKyl_QRqS_UXl-H0twKVamlTFyDs79UKo_Is2yzRvQKdZ3plTVnC4od46PZ2rFBYdj8w$> , untarred it and ran paprica-make_ref.py (I use paprica v0.7.2): ./paprica-make_ref.py \ -ref_dir ref_genome_database \ -download test \ -domain bacteria \ -cpus 4 And here is the output: Checking for reference database directories, will create if necessary... [Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers. Checking files for accession /mnt/1.5_drive_0/tmp/paprica_sandbox/fake_db/ref_genome_database/bacteria/refseq/GCF_000007485.1 # [... checking more files, skip it ...] Checking files for accession /mnt/1.5_drive_0/tmp/paprica_sandbox/fake_db/ref_genome_database/bacteria/refseq/GCF_001314225.1 [Parallel(n_jobs=-1)]: Done 4 out of 10 | elapsed: 1.0s remaining: 1.5s [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 1.0s remaining: 0.4s [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 1.2s remaining: 0.0s [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 1.2s finished [Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers. [Parallel(n_jobs=-1)]: Done 0 out of 0 | elapsed: 0.0s finished [Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers. [Parallel(n_jobs=-1)]: Done 0 out of 0 | elapsed: 0.0s finished Traceback (most recent call last): File "./paprica-make_ref.py", line 736, in <module> for record in SeqIO.parse(ref_dir_domain + 'refseq/' + d + '/' + d + '.23S.fasta', 'fasta'): File "/home/cager/Misc_soft/py-venv/lib/python3.8/site-packages/Bio/SeqIO/__init__.py", line 607, in parse return iterator_generator(handle) File "/home/cager/Misc_soft/py-venv/lib/python3.8/site-packages/Bio/SeqIO/FastaIO.py", line 183, in __init__ super().__init__(source, mode="t", fmt="Fasta") File "/home/cager/Misc_soft/py-venv/lib/python3.8/site-packages/Bio/SeqIO/Interfaces.py", line 47, in __init__ self.stream = open(source, "r" + mode) FileNotFoundError: [Errno 2] No such file or directory: '/mnt/1.5_drive_0/tmp/paprica_sandbox/fake_db/ref_genome_database/bacteria/refseq/GCF_000007485.1/GCF_000007485.1.23S.fasta' I see in paprica-make_ref.py:check_directory that the script checks if a file *23S.fasta exists (line 387). And if some required files don’t exist, the whole directory gets removed (lines 399-400). Indeed, after I run paprica-make_ref.py, all GCF_* directories in ref_genome_database/bacteria/refseq/ become empty. And indeed, in a freshly downloaded ref_genome_database/bacteria/refseq/, all GCF_* directories don’t contain a file ending with 23S.fasta. Is the archive really misconfigured? Or I’m getting something wrong? — Reply to this email directly, view it on GitHub <https://urldefense.com/v3/__https:/github.com/bowmanjeffs/paprica/issues/96__;!!Mih3wA!DpgfImjzkfFJD9zTyS61-6FKyl_QRqS_UXl-H0twKVamlTFyDs79UKo_Is2yzRvQKdZ3plTVnC4od46PZ2rsE8A3UA$> , or unsubscribe <https://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/AA4JHVBECOJXQKFITXNM5MDWWSX63ANCNFSM6AAAAAAUWI6HUY__;!!Mih3wA!DpgfImjzkfFJD9zTyS61-6FKyl_QRqS_UXl-H0twKVamlTFyDs79UKo_Is2yzRvQKdZ3plTVnC4od46PZ2rzFYS8ZA$> . You are receiving this because you are subscribed to this thread. <https://github.com/notifications/beacon/AA4JHVCOXZ36OJJHZG5M3QDWWSX63A5CNFSM6AAAAAAUWI6HU2WGG33NNVSW45C7OR4XAZNFJFZXG5LFVJRW63LNMVXHIX3JMTHF4BXIQY.gif> Message ID: ***@***.*** ***@***.***> >

masikol · 2023-02-10T10:40:18Z

Is it critical for what you need to do?

Not really.

I generally recommend against trying to build your own database

Okay, I’ll just get along without it.

Thank you, Jeff!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The “fake” database seems to be incomplete #96

The “fake” database seems to be incomplete #96

masikol commented Feb 9, 2023

bowmanjeffs commented Feb 9, 2023 via email

masikol commented Feb 10, 2023

The “fake” database seems to be incomplete #96

The “fake” database seems to be incomplete #96

Comments

masikol commented Feb 9, 2023

bowmanjeffs commented Feb 9, 2023 via email

masikol commented Feb 10, 2023