feat: read cloud creds for obstore from env #556

alekzvik · 2024-12-05T15:10:10Z

Closes

Allow discovery of AWS configuration from environment #414

Description

This is POC.

Right now we init object_store in format.rs.
We use the object_store::parse_url_opts function to parse the URL, extract store kind, bucket, etc from it and build a store from it.

I yanked this function to change one line in a macro from builder::new() to buidler::from_env()
Pros:

do not need to think about the env reusing the builder logic there
do not need to explicitly make a store object we interact with

Cons:

still create a new ObjectStore for each format.get_opts() call
reads env a lot

After that, I created a per-bucket cache for object stores. The idea is not to expose the ObjectStore object anywhere but to keep using a combination of item URL & options to get the one you need. A similar approach to dealing with object stores is used in polars. I tried to re-use any URL parsing from the object_store crate to reduce the maintenance overhead.

Alternatives I would be happy to discuss:

making object_store more of an explicit item inside the code base, separating it from format and maybe starting to pass it around (e.g. init in cli entrypoint, pass down to others)
adding some long-living pool of per-bucket/creds object stores that are instantiated and utilized explicitly.

Overall, I'd like to discuss if it is a good approach to the problem or if you have other possible directions to explore.

Checklist

Delete any checklist items that do not apply (e.g. if your change is minor, it may not require documentation updates).

Unit tests
Documentation, including doctests
Git history is linear
Commit messages are descriptive
(optional) Git commit messages follow conventional commits
Code is formatted (cargo fmt)
cargo test
Changes are added to the CHANGELOG

alekzvik · 2024-12-10T12:07:04Z

The first commit adds just reading from the env, the second adds a cache for object_store objects for more efficient re-use.

I have not finished dealing with errors in this, but I want your initial opinion to see if I should proceed in this direction.

The test failures are unrelated and due to 403 HTTP error in validation (should we mock the schema query there to be less dependent on the internet?)

gadomski

Generally looks good! I think I'd prefer a struct-specific cache instead of a global one. So maybe a ObjectStore structure that holds the cache?

gadomski · 2024-12-11T13:13:50Z

The test failures are unrelated and due to 403 HTTP error in validation (should we mock the schema query there to be less dependent on the internet?)

Yeah, I've run into this before, I forget how I solved it initially (it's something to do with cloudflare being cranky about our requests for something that's a redirect). While I do want to test network-based validation (the STAC ecosystem does it a lot) I agree that network-based tests are fragile and bad. I'd be down for even just ignoring them for now and adding a tracking issue to fix.

alekzvik · 2024-12-11T13:45:05Z

Generally looks good! I think I'd prefer a struct-specific cache instead of a global one. So maybe a ObjectStore structure that holds the cache?

I am a bit unclear on this.

The current idea is that the ObjectStore client remains "hidden" from everyone and the only way to create it is to call get_opts or put_opts. Those functions will parse the URL and options and either make one (and put it in the cache) or get one from the static cache. The main goal here is if, for example, I use a CLI translate function to read and then write a STAC object, and both input & output paths use the same bucket, I will reuse the same client for both IO operations.

If we put the cache in a struct, who will be responsible for creating it? How do I make it outlive the get_opts and put_opts functions so that they can both reuse the same store?

Note

just a reminder, that I started learning rust a few weeks ago, and might miss very basic things.

gadomski · 2024-12-12T00:24:32Z

If we put the cache in a struct, who will be responsible for creating it? How do I make it outlive the get_opts and put_opts functions so that they can both reuse the same store?

Yup, it's a good call-out. I'd like to expand our API to include stac::ObjectStore or some-such, a structure that folks can use to get and put STAC things. get_opts would create a oneshot instation of ObjectStore to do its request and then destroy it.

This mirrors the pattern of stuff like reqwests, which has reqwests::get for a single-use, and a Client for multi-use.

alekzvik · 2024-12-12T13:39:08Z

Intro

ok, now I understand what you want.

Let me explain why I think global cache is a better approach.

Before implementing this, I looked at who uses the object_store crate and how. Two examples jumped out: datafusion & polars. Both have the same concept of a cache for object_store clients but approach it differently.

Datafusion

The main datafusion interaction pattern is to create a SessionContext and interact with it (example from docs). This is very similar to the pattern with a session that reqwest uses. The context holds a RuntimeEnv (source) that has a pool for object stores (source). Stores are identified by a base URL and you can and must register them in advance. By default. only file-based object_store is initialized. If you want to use cloud storage, you need to register the store explicitly.

Polars

Polars' main interaction pattern is different, you deal with data frames (example from docs). Dataframes can be read from wherever. So, they build object stores based on a URL (source) and use global cache (source).

stac-rs

I would argue that our use cases and interaction pattern are much closer to polars than datafusion. I would even go one step further, and say that it is more dynamic.

Note

Important things to understand is that object_store operates on a per-bucket level

Our main "object" that we interact with is STAC. There is a recommendation that STACs should be colocated, but no guarantees. Consider the following example. I am working with a public STAC of raster imagery, I apply a set of filters, produce a new collection, and save it to my cloud storage. My collection is located in one bucket, my items are in a different bucket. This is a valid STAC, that we must deal with. When I receive a collection, there is no way for me to construct an object_store in advance, as I do not know where all the links would lead.

Because of the dynamic nature of the STAC, it can potentially be located in more than one bucket (or even cloud provider). At the same time, STAC as a format can have huge amounts of files. So, I would argue we need to focus on both the dynamic aspect of reacting to new locations and optimization to be able to fetch multiple files effectively. I guess the main point is, that we never have a guarantee that one object store is enough, so we need to be prepared to work with multiple. And exposing this outside seems cumbersome.

Alternative API

Let's try to explore the alternative, where we expose the registry that lives in its own trait and has a more limited lifetime

let url = URL::parse("<some stac url>");
let registry = ObjectStoreRegistry::new();
registry.register_store(url);
let value = stac::io::get(url, registry);
for link in value.links() {
    ???
}

In this example, if we use the datafusion approach, we need to somewhere decide if we need to create a new store. It seems that exposing the ObjectStoreRegistry does not give any benefits but rather more cumbersome interactions.

Conclusion

Given all of the above, I chose the global cache approach, as it is flexible enough, the implementation is straightforward, and user interaction is streamlined.

Pros

You can always use options to adjust the underlying store.

let value = stac::io::get_opts(stac_url, vec![("aws_secret", "****")]);

If you do not need to think about object stores, you do not have to
You get benefits of reusing one client

Cons

This cache trickery is implicit. Implicit things are usually bad. But I would argue, that in this case, as a user of a library, I do not want to deal with multiple object stores and rather focus on STACs
Typing and explicit/implicit choice. While you still have a way to configure your object store, it is all based on a "vec of tuples of strings", and you do not benefit from type checks. The alternative, where you can construct your own object store seems more robust. We can consider adding another function that accepts an object store and does IO with it or adds it to the cache.

gadomski · 2024-12-18T00:55:23Z

I don't hate that, and I agree that we're more likely to be used in a polars-style use case. Thanks for writing it out.

Since it is so tricky, my initial instinct is to make a new crate for it, so we don't have to break the core crate's semver when we (inevitably) change everything.

alekzvik · 2024-12-21T23:26:41Z

Changes since last time:

Move the code to a separate crate
Add new crate to CI
use the crate in core under already used object-store feature

Please let me know if anything can be improved.

Oh, and also, at some point I merged the main in my branch, then I saw that merge commits are not allowed, so I rebased and force-pushed. At this point, the history is linear, but merging is still blocked. Is there anything I can do?

gadomski

Seems reasonable to me, requested a couple of changes. However, doesn't seem to work:

cargo run --all-features -- translate https://raw.githubusercontent.com/radiantearth/stac-spec/refs/heads/master/examples/simple-item.json

This command works on main.

gadomski · 2024-12-23T03:07:27Z