-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
supporting file loading over HTTP - thoughts and concerns #2257
Comments
oh - why not support http URIs in/for/with screed as well, so you can use them for sketching? I bet that would make @bluegenes happy. |
taxonomy files (CSV at least) should also be loadable via http URIs. |
re (note for future: https://crates.io/crates/object_store might be an start on the Rust side, more info -> https://www.influxdata.com/blog/rust-object-store-donation/) |
On the static sqlite side: the |
Clint Valentine suggests caching - https://twitter.com/clintcodesbio/status/1566808784852377605! My thoughts are that this adds a lot of complexity for feature that no one is yet using - with the caveat that people might not use it if it doesn't work well, which might involve implementing caching 😆 Note to self:
|
yes, dug in a bit, between that or |
implemented with fsspec as a plug-in here: https://github.com/sourmash-bio/sourmash_plugin_load_urls |
over in #2256, I have an experimental PR that adds direct loading of .sig files (JSON format) via HTTP GET, using the
requests
library. It's remarkably simple and it provides a lot of interesting functionality, especially when you combine it with standalone manifests!On the good side, this would support things like manifest-only provision of databases, where we can provide a simple CSV file for all of GTDB, and then use picklists and
--include
to grab specific sketches for search etc. Super convenient!On the bad side, we would inevitably have users doing all-GTDB searches using the provided manifest :). This would be bad technically because it could put a lot of network load on the user and maybe the server side, and it would be bad from a UX perspective because it would be really, really slow without much indication of what's going on.
I'm not quite sure how to deal with this from a UX or design perspective... we could provide a warning message based on an internal counter tracking bytes-downloaded-from-network, or something? And if it goes over (say) 10 MB, we mention that whatever you're doing might not be a good idea?
That all having been said, I really really like the convenience of direct-loading stuff from HTTPS! It's a super powerful feature that is really generative and enabling!
Further thoughts -
we could also support loading manifest CSV files directly. I think this is an obvious feature to support; then (for the user) it becomes as simple as 'load this HTTP thing' and you can get either a standalone manifest or a signature file which both present as a database.
I don't like the idea of supporting LCA JSON databases tho. Big, unwieldy. Make people download 'em ;).
could we serve and load directly from within zip file collections using a static web server??
it would also be interesting to explore direct loading of signatures and manifests from static SQLite via something like this - Hosting SQLite databases on Github Pages (or any static file hoster).
we could support this as a more generic "Web Storage" class, e.g. to support things like mastiff - see prefetch-only
Index
classes and/or remote servers? #2229 - where you have a prefetch-search-server and a signature-storage-server.a specialized Web storage class (perhaps configured by a custom JSON thinger?) would be one way of handling another nuisance, which is that when you use a standalone manifest with HTTP URLs, the
internal_location
contains the entire URL, which becomes very redundant for a server that contains millions of files. It would be nice to say "look, we have a lot of accessions at this standard-format URI, here's how you construct a specific URI given an accession".All of this also nicely complements the idea of a
sourmash download
style command where we provide a structured JSON list of available databases that you can grab; see #1005 for discussion. Also could be a nice way to support lightweight patch databases ("here's an override signature") - #985.The text was updated successfully, but these errors were encountered: