split database construction and release processes; provide database catalogs #1569

ctb · 2021-06-05T13:25:56Z

Over in sourmash_databases, we have pipelines that sketch genomes and produce zipfile collections at various ksizes, moltypes, etc.

Separately, we have been taking these zipfile collections and constructing .sbt.zip and lca.json.gz indexed databases from them. This is really nice and easy now! (sourmash index out.sbt.zip in.zip)

I think we should spit these processes formally and automate this latter process with snakemake. This latter process would:

take as input a CSV or YAML file containing the collection name, the zipfile info with ksize and moltype, and their taxonomy spreadsheets;
produce all standard indices (currently sbt.zip and lca.json.gz);
create content catalogs and validate content lists with sourmash sig describe;
upload the latest version of the databases;
(maybe) produce a database catalog that could be used to do things like
- search for available databases/releases with a new sourmash subcommand
- automatically download them
- find databases that have a particular accession or genome in them (based on the catalog, not the signatures ;)

re #991 (distributed as bdbags?) and #1511 (what databases should we provide?) and maybe also #1352 (manifests)

The text was updated successfully, but these errors were encountered:

ctb · 2021-06-05T13:30:32Z

oh, and I think we might need database versions, too...? or at least md5sum hashes so we can tell people they have the wrong version of a database.

ctb · 2021-06-12T20:40:15Z

note to self: distribute outputs of sourmash sig describe --csv catalog.csv, which will then be useful as picklists :)

ctb · 2021-06-26T15:36:30Z

see Snakefile etc, #1511 (comment)

I think it would be good to provide some minimal benchmarks with each database/release in terms of memory usage and so on, too.

ctb · 2022-03-16T12:45:37Z

Progress!

I think the next step will be to add identifier filtering for the genbank script.

Using the latest code in https://github.com/ctb/2022-sourmash-sketchfrom, all of the below examples produce a CSV file that's compatible with sourmash sketch fromfile. 🎉

They also do the Right Thing with respect to names, so the sequences end up being named properly. 🎉

make a `fromfile` CSV from genbank genome/protein files

% ./genbank-to-fromfile.py ncbi-assemblies/* -o xyz.csv -t gtdb-rs202.taxonomy.v2.db 
processing file 'ncbi-assemblies/GCF_000018865.1_ASM1886v1_genomic.fna.gz'
(new record for name 'GCF_000018865.1 s__Chloroflexus aurantiacus')
processing file 'ncbi-assemblies/GCF_000018865.1_ASM1886v1_protein.faa.gz'
(merging into existing record)
---
wrote 1 entries to 'xyz.csv'

make a `fromfile` CSV from FASTA files based on record names

note: fasta-to-fromfile.py autodetects sequence type.

./fasta-to-fromfile.py podar-ref/[12].fa -o podar.csv 
processing file 'podar-ref/1.fa'
(new record for identifier 'CP001941' moltype=DNA)
processing file 'podar-ref/2.fa'
(new record for identifier 'CP001071' moltype=DNA)
---
wrote 2 entries to 'podar.csv'

make a `fromfile` CSV from FASTA files based on filename

% ./fasta-to-fromfile.py podar-ref/[12].fa -o podar.csv --ident-from-filename
processing file 'podar-ref/1.fa'
(new record for identifier '1' moltype=DNA)
processing file 'podar-ref/2.fa'
(new record for identifier '2' moltype=DNA)
---
wrote 2 entries to 'podar.csv'

bluegenes · 2022-03-16T16:28:46Z

Really excited about this!!

Genbank-to-fromfile got me thinking about downloading the FASTA files -have you thought about generating a fromfile csv via the genbank-genomes style information, with download urls included?

Perhaps what I'm thinking of is that we would like to generate a csv like this for the download/prepare FASTA files side of the workflow. Since it would contain the info we need for sketch fromfile, we could then also use it here.

Would this be better over in sourmash_databases? It's not always as simple as download --> sketch, since sometimes .faa files don't exist or assemblies get updated. So we probably want to be able to check these cases at some point while/before building the databases.

ctb · 2022-03-20T14:36:41Z

Really excited about this!!

Genbank-to-fromfile got me thinking about downloading the FASTA files -have you thought about generating a fromfile csv via the genbank-genomes style information, with download urls included?

We already have code in genome-grist to download the genomes based on accession, which leads me to think in two directions:

one is that we can further specialize the genbank-based workflow to deal with finding URLs, etc.
the other is that (like the fromfile building stuff and genome-grist) this doesn't belong in sourmash per se.

Perhaps what I'm thinking of is that we would like to generate a csv like this for the download/prepare FASTA files side of the workflow. Since it would contain the info we need for sketch fromfile, we could then also use it here.

Would this be better over in sourmash_databases? It's not always as simple as download --> sketch, since sometimes .faa files don't exist or assemblies get updated. So we probably want to be able to check these cases at some point while/before building the databases.

I think this should be part of a separate workflow (but having the issue here is fine :).

The high latency involved in downloading lots of remote files makes it a whole different ballgame. But it sure would be nice to have automatic genome downloading, proteome preparation, etc.!

mr-eyes · 2022-03-20T16:08:30Z

Don't know if that helps,
Recently, I automated the download of genomes by accessions through the new NCBI API. Here's what I did:

wget -nc https://api.ncbi.nlm.nih.gov/datasets/v1/genome/accession/GCA_019454045.1/download -O GCA_019454045.1.zip
unzip GCA_019454045.1.zip -d GCA_019454045.1
# This because the extracted directory might contain multiple files, like chromosomes, one chr per file.
cat GCA_019454045.1/ncbi_dataset/data/GCA_019454045.1/*fna > GCA_019454045.1.fna
rm -rf GCA_019454045.1/

ctb · 2022-03-21T16:01:48Z

Don't know if that helps, Recently, I automated the download of genomes by accessions through the new NCBI API.

very cool!! We should probably change genome-grist to use this.

This doesn't change my hot take that it is productive to separate:

(1) high latency/one-time efforts like downloading new genomes and proteomes
(2) big-compute one-time efforts like computing protein sets for genomes where they do not yet exist
(3) big-compute/big correlation but infrequent efforts like computing new sketches for very large collections
(4) annoying integrative efforts to produce new databases that correctly represent all of the above

I think we have a handle on most of these as separate processes and think that combining them into one big workflow would make them frustrating and hard to debug.

eventually we will probably want to automate more of this for diff or patch databases, a la #985

and of course there are other sketching targets to think about.

ctb · 2022-05-01T13:46:45Z

closing in favor of #2015.

taylorreiter mentioned this issue Jun 11, 2021

[MRG] Add new db description to docs and start legacy databases page #1581

Merged

ctb mentioned this issue May 1, 2022

sourmash database construction - current status and future thoughts #2015

Open

ctb closed this as completed May 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

split database construction and release processes; provide database catalogs #1569

split database construction and release processes; provide database catalogs #1569

ctb commented Jun 5, 2021

ctb commented Jun 5, 2021

ctb commented Jun 12, 2021

ctb commented Jun 26, 2021

ctb commented Mar 16, 2022 •

edited

Loading

bluegenes commented Mar 16, 2022

ctb commented Mar 20, 2022

mr-eyes commented Mar 20, 2022

ctb commented Mar 21, 2022

ctb commented May 1, 2022

split database construction and release processes; provide database catalogs #1569

split database construction and release processes; provide database catalogs #1569

Comments

ctb commented Jun 5, 2021

ctb commented Jun 5, 2021

ctb commented Jun 12, 2021

ctb commented Jun 26, 2021

ctb commented Mar 16, 2022 • edited Loading

make a fromfile CSV from genbank genome/protein files

make a fromfile CSV from FASTA files based on record names

make a fromfile CSV from FASTA files based on filename

bluegenes commented Mar 16, 2022

ctb commented Mar 20, 2022

mr-eyes commented Mar 20, 2022

ctb commented Mar 21, 2022

ctb commented May 1, 2022

ctb commented Mar 16, 2022 •

edited

Loading

make a `fromfile` CSV from genbank genome/protein files

make a `fromfile` CSV from FASTA files based on record names

make a `fromfile` CSV from FASTA files based on filename