-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
split database construction and release processes; provide database catalogs #1569
Comments
oh, and I think we might need database versions, too...? or at least md5sum hashes so we can tell people they have the wrong version of a database. |
note to self: distribute outputs of |
see I think it would be good to provide some minimal benchmarks with each database/release in terms of memory usage and so on, too. |
Progress! I think the next step will be to add identifier filtering for the genbank script. Using the latest code in https://github.com/ctb/2022-sourmash-sketchfrom, all of the below examples produce a CSV file that's compatible with They also do the Right Thing with respect to names, so the sequences end up being named properly. 🎉 make a
|
Really excited about this!! Genbank-to-fromfile got me thinking about downloading the FASTA files -have you thought about generating a fromfile csv via the genbank-genomes style information, with download urls included? Perhaps what I'm thinking of is that we would like to generate a csv like this for the download/prepare FASTA files side of the workflow. Since it would contain the info we need for sketch fromfile, we could then also use it here. Would this be better over in sourmash_databases? It's not always as simple as download --> sketch, since sometimes .faa files don't exist or assemblies get updated. So we probably want to be able to check these cases at some point while/before building the databases. |
We already have code in genome-grist to download the genomes based on accession, which leads me to think in two directions:
I think this should be part of a separate workflow (but having the issue here is fine :). The high latency involved in downloading lots of remote files makes it a whole different ballgame. But it sure would be nice to have automatic genome downloading, proteome preparation, etc.! |
Don't know if that helps, wget -nc https://api.ncbi.nlm.nih.gov/datasets/v1/genome/accession/GCA_019454045.1/download -O GCA_019454045.1.zip
unzip GCA_019454045.1.zip -d GCA_019454045.1
# This because the extracted directory might contain multiple files, like chromosomes, one chr per file.
cat GCA_019454045.1/ncbi_dataset/data/GCA_019454045.1/*fna > GCA_019454045.1.fna
rm -rf GCA_019454045.1/ |
very cool!! We should probably change genome-grist to use this. This doesn't change my hot take that it is productive to separate: (1) high latency/one-time efforts like downloading new genomes and proteomes I think we have a handle on most of these as separate processes and think that combining them into one big workflow would make them frustrating and hard to debug. eventually we will probably want to automate more of this for diff or patch databases, a la #985 and of course there are other sketching targets to think about. |
closing in favor of #2015. |
Over in sourmash_databases, we have pipelines that sketch genomes and produce zipfile collections at various ksizes, moltypes, etc.
Separately, we have been taking these zipfile collections and constructing .sbt.zip and lca.json.gz indexed databases from them. This is really nice and easy now! (
sourmash index out.sbt.zip in.zip
)I think we should spit these processes formally and automate this latter process with snakemake. This latter process would:
sourmash sig describe
;re #991 (distributed as bdbags?) and #1511 (what databases should we provide?) and maybe also #1352 (manifests)
The text was updated successfully, but these errors were encountered: