Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

supporting file loading over HTTP - thoughts and concerns #2257

Open
ctb opened this issue Sep 5, 2022 · 7 comments
Open

supporting file loading over HTTP - thoughts and concerns #2257

ctb opened this issue Sep 5, 2022 · 7 comments

Comments

@ctb
Copy link
Contributor

ctb commented Sep 5, 2022

over in #2256, I have an experimental PR that adds direct loading of .sig files (JSON format) via HTTP GET, using the requests library. It's remarkably simple and it provides a lot of interesting functionality, especially when you combine it with standalone manifests!

On the good side, this would support things like manifest-only provision of databases, where we can provide a simple CSV file for all of GTDB, and then use picklists and --include to grab specific sketches for search etc. Super convenient!

On the bad side, we would inevitably have users doing all-GTDB searches using the provided manifest :). This would be bad technically because it could put a lot of network load on the user and maybe the server side, and it would be bad from a UX perspective because it would be really, really slow without much indication of what's going on.

I'm not quite sure how to deal with this from a UX or design perspective... we could provide a warning message based on an internal counter tracking bytes-downloaded-from-network, or something? And if it goes over (say) 10 MB, we mention that whatever you're doing might not be a good idea?

That all having been said, I really really like the convenience of direct-loading stuff from HTTPS! It's a super powerful feature that is really generative and enabling!

Further thoughts -

  • we could also support loading manifest CSV files directly. I think this is an obvious feature to support; then (for the user) it becomes as simple as 'load this HTTP thing' and you can get either a standalone manifest or a signature file which both present as a database.

  • I don't like the idea of supporting LCA JSON databases tho. Big, unwieldy. Make people download 'em ;).

  • could we serve and load directly from within zip file collections using a static web server??

  • it would also be interesting to explore direct loading of signatures and manifests from static SQLite via something like this - Hosting SQLite databases on Github Pages (or any static file hoster).

  • we could support this as a more generic "Web Storage" class, e.g. to support things like mastiff - see prefetch-only Index classes and/or remote servers? #2229 - where you have a prefetch-search-server and a signature-storage-server.

  • a specialized Web storage class (perhaps configured by a custom JSON thinger?) would be one way of handling another nuisance, which is that when you use a standalone manifest with HTTP URLs, the internal_location contains the entire URL, which becomes very redundant for a server that contains millions of files. It would be nice to say "look, we have a lot of accessions at this standard-format URI, here's how you construct a specific URI given an accession".

All of this also nicely complements the idea of a sourmash download style command where we provide a structured JSON list of available databases that you can grab; see #1005 for discussion. Also could be a nice way to support lightweight patch databases ("here's an override signature") - #985.

@ctb
Copy link
Contributor Author

ctb commented Sep 5, 2022

oh - why not support http URIs in/for/with screed as well, so you can use them for sketching? I bet that would make @bluegenes happy.

@ctb
Copy link
Contributor Author

ctb commented Sep 5, 2022

taxonomy files (CSV at least) should also be loadable via http URIs.

@luizirber
Copy link
Member

luizirber commented Sep 5, 2022

re WebStorage: maybe check fsspec before we start reinventing this =]

(note for future: https://crates.io/crates/object_store might be an start on the Rust side, more info -> https://www.influxdata.com/blog/rust-object-store-donation/)

@luizirber
Copy link
Member

On the static sqlite side: the wort DB is currently in Postgres, but I've been thinking about moving it into sqlite and make wort into a static site. The DB is sort of midway towards a manifest anyway, the important thing is Database/Dataset here (and all the User/Task functionality can be managed in other places).

@ctb
Copy link
Contributor Author

ctb commented Sep 5, 2022

Clint Valentine suggests caching - https://twitter.com/clintcodesbio/status/1566808784852377605!

My thoughts are that this adds a lot of complexity for feature that no one is yet using - with the caveat that people might not use it if it doesn't work well, which might involve implementing caching 😆

Note to self:

  • look into python urllib3 or requests caching frameworks that cache to disk - e.g. requests.caching (link)
  • also look into urllib3 or requests download tracking - see stackoverflow answer re requests

@ctb
Copy link
Contributor Author

ctb commented Sep 5, 2022

re WebStorage: maybe check fsspec before we start reinventing this =]

yes, dug in a bit, between that or smart_open perhaps we should just generically support opening URLs!

@ctb
Copy link
Contributor Author

ctb commented Jan 1, 2023

implemented with fsspec as a plug-in here: https://github.com/sourmash-bio/sourmash_plugin_load_urls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants