-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The “fake” database seems to be incomplete #96
Comments
Maxim,
You are correct, I haven’t provided the “fake” database in a long time (I’m actually surprised you were able to find it!). Is it critical for what you need to do? I generally recommend against trying to build your own database and encourage you to use the one we provide.
Jeff
=======================
Jeff Bowman
Associate Professor
Scripps Institution of Oceanography
www.polarmicrobes.org
From: Maxim Sikolenko ***@***.***>
Sent: Thursday, February 9, 2023 1:10 AM
To: bowmanjeffs/paprica ***@***.***>
Cc: Subscribed ***@***.***>
Subject: [bowmanjeffs/paprica] The “fake” database seems to be incomplete (Issue #96)
Hello!
It seems that the “fake” ref_genome_database.tgs archive is incomplete: it lacks file(s) of 23S rRNA sequences.
And here is why I think so.
I’m trying to build a custom database for paprica. So I follow the tutorial <https://urldefense.com/v3/__https:/www.polarmicrobes.org/building-the-paprica-database/__;!!Mih3wA!DpgfImjzkfFJD9zTyS61-6FKyl_QRqS_UXl-H0twKVamlTFyDs79UKo_Is2yzRvQKdZ3plTVnC4od46PZ2roIl3TjQ$> : I downloaded the “fake” mini-database <https://urldefense.com/v3/__http:/www.polarmicrobes.org/extras/ref_genome_database.tgz__;!!Mih3wA!DpgfImjzkfFJD9zTyS61-6FKyl_QRqS_UXl-H0twKVamlTFyDs79UKo_Is2yzRvQKdZ3plTVnC4od46PZ2rFBYdj8w$> , untarred it and ran paprica-make_ref.py (I use paprica v0.7.2):
./paprica-make_ref.py \
-ref_dir ref_genome_database \
-download test \
-domain bacteria \
-cpus 4
And here is the output:
Checking for reference database directories, will create if necessary...
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
Checking files for accession /mnt/1.5_drive_0/tmp/paprica_sandbox/fake_db/ref_genome_database/bacteria/refseq/GCF_000007485.1
# [... checking more files, skip it ...]
Checking files for accession /mnt/1.5_drive_0/tmp/paprica_sandbox/fake_db/ref_genome_database/bacteria/refseq/GCF_001314225.1
[Parallel(n_jobs=-1)]: Done 4 out of 10 | elapsed: 1.0s remaining: 1.5s
[Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 1.0s remaining: 0.4s
[Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 1.2s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 1.2s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 0 out of 0 | elapsed: 0.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 0 out of 0 | elapsed: 0.0s finished
Traceback (most recent call last):
File "./paprica-make_ref.py", line 736, in <module>
for record in SeqIO.parse(ref_dir_domain + 'refseq/' + d + '/' + d + '.23S.fasta', 'fasta'):
File "/home/cager/Misc_soft/py-venv/lib/python3.8/site-packages/Bio/SeqIO/__init__.py", line 607, in parse
return iterator_generator(handle)
File "/home/cager/Misc_soft/py-venv/lib/python3.8/site-packages/Bio/SeqIO/FastaIO.py", line 183, in __init__
super().__init__(source, mode="t", fmt="Fasta")
File "/home/cager/Misc_soft/py-venv/lib/python3.8/site-packages/Bio/SeqIO/Interfaces.py", line 47, in __init__
self.stream = open(source, "r" + mode)
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/1.5_drive_0/tmp/paprica_sandbox/fake_db/ref_genome_database/bacteria/refseq/GCF_000007485.1/GCF_000007485.1.23S.fasta'
I see in paprica-make_ref.py:check_directory that the script checks if a file *23S.fasta exists (line 387). And if some required files don’t exist, the whole directory gets removed (lines 399-400). Indeed, after I run paprica-make_ref.py, all GCF_* directories in ref_genome_database/bacteria/refseq/ become empty.
And indeed, in a freshly downloaded ref_genome_database/bacteria/refseq/, all GCF_* directories don’t contain a file ending with 23S.fasta.
Is the archive really misconfigured? Or I’m getting something wrong?
—
Reply to this email directly, view it on GitHub <https://urldefense.com/v3/__https:/github.com/bowmanjeffs/paprica/issues/96__;!!Mih3wA!DpgfImjzkfFJD9zTyS61-6FKyl_QRqS_UXl-H0twKVamlTFyDs79UKo_Is2yzRvQKdZ3plTVnC4od46PZ2rsE8A3UA$> , or unsubscribe <https://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/AA4JHVBECOJXQKFITXNM5MDWWSX63ANCNFSM6AAAAAAUWI6HUY__;!!Mih3wA!DpgfImjzkfFJD9zTyS61-6FKyl_QRqS_UXl-H0twKVamlTFyDs79UKo_Is2yzRvQKdZ3plTVnC4od46PZ2rzFYS8ZA$> .
You are receiving this because you are subscribed to this thread. <https://github.com/notifications/beacon/AA4JHVCOXZ36OJJHZG5M3QDWWSX63A5CNFSM6AAAAAAUWI6HU2WGG33NNVSW45C7OR4XAZNFJFZXG5LFVJRW63LNMVXHIX3JMTHF4BXIQY.gif> Message ID: ***@***.*** ***@***.***> >
|
Not really.
Okay, I’ll just get along without it. Thank you, Jeff! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hello!
It seems that the “fake”
ref_genome_database.tgs
archive is incomplete: it lacks file(s) of 23S rRNA sequences.And here is why I think so.
I’m trying to build a custom database for paprica. So I follow the tutorial: I downloaded the “fake” mini-database, untarred it and ran
paprica-make_ref.py
(I use paprica v0.7.2):And here is the output:
I see in
paprica-make_ref.py:check_directory
that the script checks if a file*23S.fasta
exists (line 387). And if some required files don’t exist, the whole directory gets removed (lines 399-400). Indeed, after I runpaprica-make_ref.py
, allGCF_*
directories inref_genome_database/bacteria/refseq/
become empty.And indeed, in a freshly downloaded
ref_genome_database/bacteria/refseq/
, allGCF_*
directories don’t contain a file ending with23S.fasta
.Is the archive really misconfigured? Or I’m getting something wrong?
The text was updated successfully, but these errors were encountered: