Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make some kind of JSON list of all "official" sourmash databases #1005

Open
ctb opened this issue May 27, 2020 · 7 comments
Open

make some kind of JSON list of all "official" sourmash databases #1005

ctb opened this issue May 27, 2020 · 7 comments

Comments

@ctb
Copy link
Contributor

ctb commented May 27, 2020

this came up in conversation with jason stajich over a year ago, and I never stuck it in an issue - it'd be great to have a machine-readable list of sourmash databases somewhere.

charcoal does a nice thing where it uses snakemake to download databases. we could do something similar here.

@ctb
Copy link
Contributor Author

ctb commented Mar 12, 2022

related: #1847

@ctb
Copy link
Contributor Author

ctb commented May 3, 2022

side note, we should definitely automate the product of parts of the databases page in docs. Especially once we have protein databases to distribute!

@luizirber
Copy link
Member

Idea: make a pooch registry -> https://www.fatiando.org/pooch/v1.6.0/multiple-files.html

@ctb
Copy link
Contributor Author

ctb commented May 6, 2022

this looks awesome!

so... wait... we could even add a sourmash CLI command that lists available databases and grabs the ksize/database type you want, I think? while still making them all available for individual download as well as programmatic download? that'd be cool.

@ctb
Copy link
Contributor Author

ctb commented May 6, 2022

we'd maybe want to create a remote registry file that we retrieve dynamically?

https://www.fatiando.org/pooch/v1.6.0/registry-files.html

@bluegenes
Copy link
Contributor

bluegenes commented Jun 23, 2022

so... wait... we could even add a sourmash CLI command that lists available databases and grabs the ksize/database type you want, I think? while still making them all available for individual download as well as programmatic download? that'd be cool.

this would be fantastic and solve some database download frustrations I have with thumper now that I want to use different database filetypes (sbt vs sql).

This might be what you're envisioning, but to be explicit -- could the user specify a database name, say gtdb-rs207 (or even gtdb), and we use command line params to generate the full db name on the fly? e.g. pick the right ksize, alphabet, scaled, filetype (sbt/zip/etc), which then allows us to download the right file(s) via the pooch registry? That way the user doesn't need to know/write out full database name (e.g. gtdb-rs207.genomic.k31.sbt.zip).

Additional thought: It would be handy to include the taxonomy file inside each database file (possible with zip, sbt.zip, and sqldb and not needed for lca, right?). That would reduce extra download code and the need to link the correct taxonomy file with each database. For taxonomy functions with official databases, users could provide the database on the command line (instead of needing to find/download the taxonomy file), and we could automatically find it. I would imagine TAXONOMY.csv, complementary to manifest file. We would still allow alternate taxonomies, of course, but at least each db would come with the official set for that db?

@ctb
Copy link
Contributor Author

ctb commented Sep 5, 2022

note that #985 - distributing diff/patch databases - fits in really nicely here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants