Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"readers" or "parsers" or what should we call them? #239

Open
TomNicholas opened this issue Sep 13, 2024 · 4 comments
Open

"readers" or "parsers" or what should we call them? #239

TomNicholas opened this issue Sep 13, 2024 · 4 comments
Labels
documentation Improvements or additions to documentation

Comments

@TomNicholas
Copy link
Member

TomNicholas commented Sep 13, 2024

I'm currently calling the functions that extract the metadata from a specific filetype (or many filetypes in kerchunk's case) "readers".

I feel like "reader" implies that they read actual array data into memory, but it doesn't do that. It's also confusing alongside Zarr "readers", which actually do read array data into memory

But is there a less overloaded term? e.g:

  • "parsers" - because they usually involve scanning a file and "parsing" out the metadata to be separate from the array data
  • "scanner"?
  • "inspector"?
  • "extractor"?

As of #231 these currently all live in a module called virtualizarr.readers.

We also have virtualizarr.writers, but that's less of a problem given that (a) there will probably only ever be two of these, (b) there aren't existing xarray backends you could get this confused with, and (c) you are writing metadata, it's not a bad term for it.

(Related to pydata/xarray#9491, thought of this whilst proposing the API in #238)

@TomNicholas TomNicholas added the documentation Improvements or additions to documentation label Sep 13, 2024
@douglatornell
Copy link

What about "get_metadata"? That explicitly says what the function does, and it can live comfortably beside functions that read array data.

@TomNicholas
Copy link
Member Author

@douglatornell that certainly clearly describes what they do! I just don't know what the corresponding noun would be, vz.open_virtual_dataset(... metadata_getter=...) hardly rolls off the tongue 😅

@douglatornell
Copy link

Perhaps I'm not understanding the nuances of the the API.

In vz.open_virtual_dataset(... metadata_getter=...), the metadata_getter=... bit would be an item in reader_options, correct? So, why not make it

vz.open_virtual_dataset(... metadata=vz.readers.get_kerchunk_metadata, ...)

for example?

But do you need to pass a function name at that level? Why not

vz.open_virtual_dataset(... metadata="kerchunk", ...)

and do the dispatching to the appropriate vz.readers.get_*_metadata() function in vz.open_virtual_dataset() where you are handling so many of the other details of different storage formats. I suppose doing that would preclude a user-defined get_*_metadata() function, if that is a feature you want/need to support.

@TomNicholas
Copy link
Member Author

the metadata_getter=... bit would be an item in reader_options, correct?

Not quite - reader_options is passed on to kerchunk readers, but there are other readers that are not based on kerchunk (e.g. the dmr++ reader).

a user-defined get_*_metadata() function

I do think we want to support that. There are a few common file formats that are parseable to the zarr model, plus a long tail of niche formats that are also parseable to the zarr model (see #218). We should ship readers for the common ones bundled here, but allows users to plug their own readers in.

But do you need to pass a function name at that level?

In general I feel like we should be trying to move towards an entrypoint system, analogous to xarray.open_dataset's BackendEntrypoint class, which is what is accessed when you use the engine=... kwarg to xr.open_dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants