Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't write a STAC Parquet file from a list of Items #462

Closed
alexgleith opened this issue Oct 17, 2024 · 13 comments
Closed

Can't write a STAC Parquet file from a list of Items #462

alexgleith opened this issue Oct 17, 2024 · 13 comments
Assignees
Labels
bug Something isn't working

Comments

@alexgleith
Copy link

Code is here:

import stacrs
from ldn.utils import get_s3_keys
from concurrent.futures import ThreadPoolExecutor

from pystac import Item

keys = get_s3_keys()

def fetch_item(key):
    return Item.from_file(f"https://{key}")

with ThreadPoolExecutor(max_workers=10) as executor:
    items = list(executor.map(fetch_item, keys))

print(f"Found {len(items)} items")

stacrs.write("items.parquet", items)

It fails with the error:

TypeError: unsupported type Item.

But type(items[0]) results in:

pystac.item.Item

@alexgleith
Copy link
Author

Also failing if converted to ItemCollection:

ic = ItemCollection(items)
stacrs.write("items.parquet", ic)

TypeError: unsupported type ItemCollection

@alexgleith
Copy link
Author

Does work with a dict!

ic = ItemCollection(items)
stacrs.write("items.parquet", ic.to_dict())

@gadomski gadomski added the bug Something isn't working label Oct 18, 2024
@gadomski gadomski self-assigned this Oct 18, 2024
@gadomski
Copy link
Member

I think you need to:

stacrs.pystac.write("items.parquet", items)

However, I'm thinking that this "do all pystac things through stacrs.pystac" model is pretty confusing for users. Let me see if I can't simplify the API to be auto-magic w/ dicts, lists, and pystac stuff.

@alexgleith
Copy link
Author

Yeah, there should be no need to know the distinction between the rust implementation and the python implementation, I don't think.

@alexgleith
Copy link
Author

Also, I don't know why it's stacrs.pystac. pystac is already taken, and holds all the times and handy structures for stac objects.

@gadomski
Copy link
Member

Yeah, there should be no need to know the distinction between the rust implementation and the python implementation, I don't think.

Well, there's a difference between the object you're passing in. A pystac.Item is a different thing than a dict[str, Any]. I was wary about adding pystac support at all for this reason — part of me just want to only work in dict and list. But I want to make sure we're making an API that makes sense for folks.

Also, I don't know why it's stacrs.pystac. pystac is already taken, and holds all the times and handy structures for stac objects.

It's a module that's meant for interoperability with pystac objects, so I figured it was the most obvious name. Open to other suggestions, of course!

NB I want to maintain "no dependencies by default" for this package, so I would have a strong preference to not depend on pystac (optional dependency is fine).

@alexgleith
Copy link
Author

NB I want to maintain "no dependencies by default" for this package

Ok, but is it really that important? This is pretty closely coupled to pystac. Is it because it'll be used in Rust more?

Maybe there should be a separate Python wrapper, and a core Rust implementation that is dependency free?

All of my interactions with STAC items in Python use pystac, and it makes things just easy. I just think it's a weird developer experience to be going back to handling dictionaries, rather than nice Python objects!

@gadomski
Copy link
Member

Ok, but is it really that important? This is pretty closely coupled to pystac. Is it because it'll be used in Rust more?

It's not coupled to pystac at all ... it only knows how to work with pystac objects if pystac is present on the system. I do think it's important to stay dependency-free — there's a lot of folks out there that don't use pystac, and I don't see any reason to pull it in if we don't need to.

Maybe there should be a separate Python wrapper, and a core Rust implementation that is dependency free?

You mean like a separate package? Yeah, that's a good idea — or we could even add it as an optional dependency to pystac 🤔 — maybe I like that, interesting.

I just think it's a weird developer experience to be going back to handling dictionaries, rather than nice Python objects!

I generally agree, and try to get people to use pystac, but many prefer dicts (including some core STAC folks cough @matthewhanson cough) or don't even know about pystac (this is common on NASA projects, I've found).

@alexgleith
Copy link
Author

Ok, I see where you're coming from.

But I really find pystac super convenient and don't want to go dead reckoning STAC docs as Python dictionaries. I love that i can just do item.validate()!

So, it's not super impactful passing in and out dictionaries, but it feels clumsy is all. Being able to transact in Items and ItemCollections would be better, in my humble opinion 😄.

@gadomski
Copy link
Member

I think part of my hesitation is that I don't want to write a pystac replacement here ... there's so much effort that's gone in to pystac that I want to make sure we keep putting sweat in over there. I just want to supplement, which is why making stacrs an optional dependency on pystac is sort of appealing (aka pip install pystac[rs] or some-such).

@gadomski
Copy link
Member

@alexgleith after some internal discussions with @kylebarron and @vincentsarago, I think I'm leaning towards the KISS route for stacrs and will remove stacrs.pystac. If folks want to use library objects, e.g. pystac or stac-pydantic, with stacrs functionality, we can build that interop in those libs. This feels especially reasonable because stacrs will then be zero-dependency always, making it an easy upstream for other libs.

@vincentsarago
Copy link
Member

To be honest my main concern is that if you try to allow pystac Object, we'll basically just use to_dict() to convert the object to dictionary. Then some users might come and say they want to control how the to_dict() method behaves. As a software maintainer is always better to defer the action to the users.

I don't think just allowing Dict is so much a burden.

@alexgleith
Copy link
Author

Ok, no worries and I understand the reasoning here.

I was confused initially as I saw the pystac code in here and assumed that was what was intended to be run as a wrapper around the rust functions.

I do like the idea of getting this stacrs functionality around reading Parquet files into another existing pystac library, perhaps pystac.client is a better target.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants