supporting file loading over HTTP - thoughts and concerns #2257

ctb · 2022-09-05T13:51:25Z

over in #2256, I have an experimental PR that adds direct loading of .sig files (JSON format) via HTTP GET, using the requests library. It's remarkably simple and it provides a lot of interesting functionality, especially when you combine it with standalone manifests!

On the good side, this would support things like manifest-only provision of databases, where we can provide a simple CSV file for all of GTDB, and then use picklists and --include to grab specific sketches for search etc. Super convenient!

On the bad side, we would inevitably have users doing all-GTDB searches using the provided manifest :). This would be bad technically because it could put a lot of network load on the user and maybe the server side, and it would be bad from a UX perspective because it would be really, really slow without much indication of what's going on.

I'm not quite sure how to deal with this from a UX or design perspective... we could provide a warning message based on an internal counter tracking bytes-downloaded-from-network, or something? And if it goes over (say) 10 MB, we mention that whatever you're doing might not be a good idea?

That all having been said, I really really like the convenience of direct-loading stuff from HTTPS! It's a super powerful feature that is really generative and enabling!

Further thoughts -

we could also support loading manifest CSV files directly. I think this is an obvious feature to support; then (for the user) it becomes as simple as 'load this HTTP thing' and you can get either a standalone manifest or a signature file which both present as a database.
I don't like the idea of supporting LCA JSON databases tho. Big, unwieldy. Make people download 'em ;).
could we serve and load directly from within zip file collections using a static web server??
it would also be interesting to explore direct loading of signatures and manifests from static SQLite via something like this - Hosting SQLite databases on Github Pages (or any static file hoster).
we could support this as a more generic "Web Storage" class, e.g. to support things like mastiff - see prefetch-only Index classes and/or remote servers? #2229 - where you have a prefetch-search-server and a signature-storage-server.
a specialized Web storage class (perhaps configured by a custom JSON thinger?) would be one way of handling another nuisance, which is that when you use a standalone manifest with HTTP URLs, the internal_location contains the entire URL, which becomes very redundant for a server that contains millions of files. It would be nice to say "look, we have a lot of accessions at this standard-format URI, here's how you construct a specific URI given an accession".

All of this also nicely complements the idea of a sourmash download style command where we provide a structured JSON list of available databases that you can grab; see #1005 for discussion. Also could be a nice way to support lightweight patch databases ("here's an override signature") - #985.

The text was updated successfully, but these errors were encountered:

ctb · 2022-09-05T13:55:47Z

oh - why not support http URIs in/for/with screed as well, so you can use them for sketching? I bet that would make @bluegenes happy.

ctb · 2022-09-05T14:09:48Z

taxonomy files (CSV at least) should also be loadable via http URIs.

luizirber · 2022-09-05T15:15:01Z

re WebStorage: maybe check fsspec before we start reinventing this =]

(note for future: https://crates.io/crates/object_store might be an start on the Rust side, more info -> https://www.influxdata.com/blog/rust-object-store-donation/)

luizirber · 2022-09-05T15:37:24Z

On the static sqlite side: the wort DB is currently in Postgres, but I've been thinking about moving it into sqlite and make wort into a static site. The DB is sort of midway towards a manifest anyway, the important thing is Database/Dataset here (and all the User/Task functionality can be managed in other places).

ctb · 2022-09-05T16:57:15Z

Clint Valentine suggests caching - https://twitter.com/clintcodesbio/status/1566808784852377605!

My thoughts are that this adds a lot of complexity for feature that no one is yet using - with the caveat that people might not use it if it doesn't work well, which might involve implementing caching 😆

Note to self:

look into python urllib3 or requests caching frameworks that cache to disk - e.g. requests.caching (link)
also look into urllib3 or requests download tracking - see stackoverflow answer re requests

ctb · 2022-09-05T17:14:26Z

re WebStorage: maybe check fsspec before we start reinventing this =]

yes, dug in a bit, between that or smart_open perhaps we should just generically support opening URLs!

ctb · 2023-01-01T16:58:50Z

implemented with fsspec as a plug-in here: https://github.com/sourmash-bio/sourmash_plugin_load_urls

ctb mentioned this issue Sep 5, 2022

[EXP] provide signature file loading function via HTTP #2256

Open

ctb mentioned this issue Sep 5, 2022

thoughts on internal_location in manifests - should it be a URI? #2258

Open

ctb mentioned this issue Dec 31, 2022

[MRG] provide an initial plugin architecture for sourmash that supports new signature saving & loading mechanisms #2428

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

supporting file loading over HTTP - thoughts and concerns #2257

supporting file loading over HTTP - thoughts and concerns #2257

ctb commented Sep 5, 2022

ctb commented Sep 5, 2022

ctb commented Sep 5, 2022

luizirber commented Sep 5, 2022 •

edited

Loading

luizirber commented Sep 5, 2022

ctb commented Sep 5, 2022

ctb commented Sep 5, 2022

ctb commented Jan 1, 2023

supporting file loading over HTTP - thoughts and concerns #2257

supporting file loading over HTTP - thoughts and concerns #2257

Comments

ctb commented Sep 5, 2022

ctb commented Sep 5, 2022

ctb commented Sep 5, 2022

luizirber commented Sep 5, 2022 • edited Loading

luizirber commented Sep 5, 2022

ctb commented Sep 5, 2022

ctb commented Sep 5, 2022

ctb commented Jan 1, 2023

luizirber commented Sep 5, 2022 •

edited

Loading