Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The “fake” database seems to be incomplete #96

Open
masikol opened this issue Feb 9, 2023 · 2 comments
Open

The “fake” database seems to be incomplete #96

masikol opened this issue Feb 9, 2023 · 2 comments

Comments

@masikol
Copy link

masikol commented Feb 9, 2023

Hello!

It seems that the “fake” ref_genome_database.tgs archive is incomplete: it lacks file(s) of 23S rRNA sequences.

And here is why I think so.
I’m trying to build a custom database for paprica. So I follow the tutorial: I downloaded the “fake” mini-database, untarred it and ran paprica-make_ref.py (I use paprica v0.7.2):

./paprica-make_ref.py \
    -ref_dir ref_genome_database \
    -download test \
    -domain bacteria \
    -cpus 4

And here is the output:

Checking for reference database directories, will create if necessary...
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
Checking files for accession /mnt/1.5_drive_0/tmp/paprica_sandbox/fake_db/ref_genome_database/bacteria/refseq/GCF_000007485.1
# [... checking more files, skip it ...]
Checking files for accession /mnt/1.5_drive_0/tmp/paprica_sandbox/fake_db/ref_genome_database/bacteria/refseq/GCF_001314225.1
[Parallel(n_jobs=-1)]: Done   4 out of  10 | elapsed:    1.0s remaining:    1.5s
[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed:    1.0s remaining:    0.4s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    1.2s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    1.2s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   0 out of   0 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   0 out of   0 | elapsed:    0.0s finished
Traceback (most recent call last):
  File "./paprica-make_ref.py", line 736, in <module>
    for record in SeqIO.parse(ref_dir_domain + 'refseq/' + d + '/' + d + '.23S.fasta', 'fasta'):
  File "/home/cager/Misc_soft/py-venv/lib/python3.8/site-packages/Bio/SeqIO/__init__.py", line 607, in parse
    return iterator_generator(handle)
  File "/home/cager/Misc_soft/py-venv/lib/python3.8/site-packages/Bio/SeqIO/FastaIO.py", line 183, in __init__
    super().__init__(source, mode="t", fmt="Fasta")
  File "/home/cager/Misc_soft/py-venv/lib/python3.8/site-packages/Bio/SeqIO/Interfaces.py", line 47, in __init__
    self.stream = open(source, "r" + mode)
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/1.5_drive_0/tmp/paprica_sandbox/fake_db/ref_genome_database/bacteria/refseq/GCF_000007485.1/GCF_000007485.1.23S.fasta'

I see in paprica-make_ref.py:check_directory that the script checks if a file *23S.fasta exists (line 387). And if some required files don’t exist, the whole directory gets removed (lines 399-400). Indeed, after I run paprica-make_ref.py, all GCF_* directories in ref_genome_database/bacteria/refseq/ become empty.

And indeed, in a freshly downloaded ref_genome_database/bacteria/refseq/, all GCF_* directories don’t contain a file ending with 23S.fasta.

Is the archive really misconfigured? Or I’m getting something wrong?

@bowmanjeffs
Copy link
Owner

bowmanjeffs commented Feb 9, 2023 via email

@masikol
Copy link
Author

masikol commented Feb 10, 2023

Is it critical for what you need to do?

Not really.

I generally recommend against trying to build your own database

Okay, I’ll just get along without it.

Thank you, Jeff!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants