Xpublish as the core of a next-gen data server #139

abkfenris · 2022-12-09T16:35:00Z

abkfenris
Dec 9, 2022
Maintainer

Xpublish has the capability to become the core of a next generation of data server. Right now it's quickly extendable and hackable, but I think we need to start building some consensus on what's next for the project to keep progressing.

Where do we go from here?

I think we need to start looking at Xpublish from several different directions.

As an extendable core

This is the most similar to what Xpublish is at this moment.

Right now Xpublish supports manually instantiating with a single or a collection of datasets, but both the dataset loading and the routers are extendable, which provide a natural seperation of concerns between dataset loading and serving.

I think the current project can evolve further to define an extendible core to make it easier to add and swap out routers/data and configuration loaders and other elements.

Alternatively, the core and extension points could be broken out into a seperate package and Xpublish package stays as a xarray.Dataset.rest accessor.

As a distro (or many)

For many data managers, they don't want to always be mucking around in the code of their data server. They would rather feed their server a config file, and have an array of services stood up for their datasets.

Various organizations also like to say 'we've standardized on such-and-such'. For example, IOOS (the U.S. Integrated Ocean Observing System) has said that the various regions serving them data should implement ERDDAP and THREDDS. (I do data management and development for one of the IOOS regions)

So I think we start at least one Xpublish based server that's designed to be something that various data managers would look at running. By building a few Xpublish based distros we can figure out what's best by being part of the core package, where the core should be extended, and what might be better off being a plugin.

Different distros may be focused on different audiences. We see that some with current data servers and systems. ERDDAP is largely focused on distilling data for you, where as THREDDS is more focused on providing services (there is some crossover). We also may see that different audiences are producing and consuming data with different levels of metadata and structure.

As a collections of routers/plugins

In between we have individual packages or plugins for various types of routers and other Xpublish extensions points. By pulling the routers and other functionality out into standalone plugins, it makes it easier to iterate and reason over them, in addition to providing the flexibility to assemble different collections of plugins together into different distros or one off instances.

For example some routers may require more processing than other (dyanmically reprojecting and generating web map tiles), so an admin might not want those plugins as they don't want to run a dask cluster, or they may want to swap out or customize the data loader.

While we are here, a few plugin ideas

WMS router
THREDDS or ERDDAP compatible dataset loader
STAC Catalog dataset loader & router
OGC EDR router
Frontend allowing selection of datasets similar to ERDDAP
IOOS QC validation
Dynamic ZarrTile generation - experimented with along with other routers during the IOOS cloud sprint
Kubernetes compatible health check endpoint

Similar projects that we may be able to collaborate with

ZarrDAP
- Also built with FastAPI, but with a bit more of a limited dataset loader (zarr and NetCDF in S3)
- Currently focused on implementing OpenDAP support
- I think that ZarrDAP could become a distro of Xpublish.
TiTiler
- Also a FastAPI app
- Focused on dynamically tiling cloud optimized GeoTIFFs
- Can be deployed to AWS Lambda functions
- Looks like some of the core processing has already been refactored into rio-tiler
XREDS
- Starting with the Xpublish distro idea, and using some external packages to implement additional routers

abkfenris · 2022-12-11T16:53:30Z

abkfenris
Dec 11, 2022
Maintainer Author

I had some momentum, so I took a swing at implementing a plugin system: #140

0 replies

benbovy · 2022-12-13T10:17:56Z

benbovy
Dec 13, 2022
Maintainer

Thanks for your thoughtful comments @abkfenris!

I think the current project can evolve further to define an extendible core to make it easier to add and swap out routers/data and configuration loaders and other elements. Alternatively, the core and extension points could be broken out into a seperate package and Xpublish package stays as a xarray.Dataset.rest accessor.

Not sure what exactly do you mean here? It is already easy to add/remove routers (or plugins in #140), but maybe your suggestion is to provide xpublish "standardized" extension mechanisms for other components of the application (e.g., data loaders, middlewares, settings, etc.)?

In general I fully agree with your thoughts.

IMO Xpublish core (this repository) should provide only the bare minimum set of routers (plugins). I'd even move the zarr router in a separate repository. I think it is better to have a lot of individual repositories / packages, each providing some xpublish plugin that defines a set of API routes. Assuming that those plugins are easily discoverable.

We might also want to make those plugins flexible and parameterizable. For example, allow protecting certain plugins or individual routes to authenticated users (#100), e.g., by adding the possibility to inject additional fastapi dependencies like those provided by fastapi-users. Such extension mechanism is implemented in titiler (in general I think we could take inspiration from titiler for many things).

It would be nice if Xpublish core could also provide convenient extension mechanisms for application settings (possibly reusing pydantic's BaseSettings, #51) and middlewares. Ideally it would be possible for plugins to expose their own settings.

Contrary to API routers, Xpublish core could come batteries included regarding all the common boilerplate things that would help making the deployment easier, e.g., a set of basic middlewares for logging, diagnostics, etc., extra dependencies like #54... It could even provide a basic command-line interface and configuration file system, like pygeoapi but simpler, although this is probably more a "nice to have" at this point.

All those helpers and extension mechanisms shouldn't prevent integrating Xpublish into other fastapi applications in order to fully leverage fastapi and avoid xpublish "lock-in". I would be nice if xpublish stays hackable to some reasonable point. I guess the needs will vary a lot from one application to another.

Regarding distros, one way to achieve it could be via conda(-forge) meta-packages like this one: pangeo-notebook. Not sure if/how it is possible to define meta packages on pypi, though.

4 replies

abkfenris Dec 13, 2022
Maintainer Author

Yes, I think we're largely on the same page.

Thanks for your thoughtful comments @abkfenris!

I think the current project can evolve further to define an extendible core to make it easier to add and swap out routers/data and configuration loaders and other elements. Alternatively, the core and extension points could be broken out into a seperate package and Xpublish package stays as a xarray.Dataset.rest accessor.

Not sure what exactly do you mean here? It is already easy to add/remove routers (or plugins in #140), but maybe your suggestion is to provide xpublish "standardized" extension mechanisms for other components of the application (e.g., data loaders, middlewares, settings, etc.)?

I may have mixed a few different thought together in there. I think we we should have a standardized set of extension mechanisms. Where those extension mechanisms live is the real question? Do they live in xpublish itself, or do they live in a new package?

Part of the decision making is what do we do with the current ds.rest accessor. Should it live in the same repo (I think it should probably be in a separate file at least) or somewhere else?

In general I fully agree with your thoughts.

IMO Xpublish core (this repository) should provide only the bare minimum set of routers (plugins). I'd even move the zarr router in a separate repository. I think it is better to have a lot of individual repositories / packages, each providing some xpublish plugin that defines a set of API routes. Assuming that those plugins are easily discoverable.

Yes, I think this repo should stay with a minimum number of routers/plugins. I kind of like having the base and zarr routers/plugins here as demonstrations, but I would also be up to moving them. I do think we might want to add some default prefixes and tags for them going forwards.

We might also want to make those plugins flexible and parameterizable. For example, allow protecting certain plugins or individual routes to authenticated users (#100), e.g., by adding the possibility to inject additional fastapi dependencies like those provided by fastapi-users. Such extension mechanism is implemented in titiler (in general I think we could take inspiration from titiler for many things).

Yes, but we probably don't need to start with that.

It would be nice if Xpublish core could also provide convenient extension mechanisms for application settings (possibly reusing pydantic's BaseSettings, #51) and middlewares. Ideally it would be possible for plugins to expose their own settings.

Contrary to API routers, Xpublish core could come batteries included regarding all the common boilerplate things that would help making the deployment easier, e.g., a set of basic middlewares for logging, diagnostics, etc., extra dependencies like #54... It could even provide a basic command-line interface and configuration file system, like pygeoapi but simpler, although this is probably more a "nice to have" at this point.

I'd prefer to make sure that the core has the right extension points, so that all of those can be provided by plugins rather than baking them all in. Or at least make them easy to override. Maybe if we do provide them with core, make them all implemented as extensions, so we make sure all of the right hooks are there?

I'm currently fighting a tool that has it's own strong opinions on how logging and exception handling should happen, and not quite enough hooks to be able to override and send exceptions to Sentry cleanly, so I've got some opinions on this matter at the moment.

All those helpers and extension mechanisms shouldn't prevent integrating Xpublish into other fastapi applications in order to fully leverage fastapi and avoid xpublish "lock-in". I would be nice if xpublish stays hackable to some reasonable point. I guess the needs will vary a lot from one application to another.

Regarding distros, one way to achieve it could be via conda(-forge) meta-packages like this one: pangeo-notebook. Not sure if/how it is possible to define meta packages on pypi, though.

I was largely thinking of non- or low-code data managers as the primary user group for distros.

That means building something even more opinionated for them to use. So their interaction with xpublish may be choosing a distro, (which they may not even know is xpublish under the hood!) and then following the setup manual that tells them to run docker run rps/xreds --server_config=config.yaml --dataset_dir=/datasets or similar. Under the hood a distro would be comprised of xpublish + a whole pile of plugins and a small chunk of code to connect and pre-configure them.

They could additionally be on PyPi and Conda as normal since they would be regular packages with dependencies.

I'm kind of thinking of xpublish core and plugins as building blocks. Developers can assemble the blocks in many different ways and swap in their preferred blocks. Distros are the assembled models that we hand to others to play with. (why yes, I was fidgeting with LEGO a lot as I've been pondering xpublish)

jhamman Dec 13, 2022
Maintainer

Part of the decision making is what do we do with the current ds.rest accessor. Should it live in the same repo (I think it should probably be in a separate file at least) or somewhere else?

Happy to see this get broken out in some way. You may also think of this as an opinionated deployment (or distro in the language used above) of Xpublish. In practice, most applications of Xpublish are going to want a more sophisticated deployment strategy here than the accessor, but it is quite convenient to get a user started.

abkfenris Dec 14, 2022
Maintainer Author

I hadn't really thought of the accessor as an opinionated deployment or distro of it's own, but that's exactly right.

The more I think about removing it, the more I think we should keep it. It's really effective at showing the power of Xpublish to a new user. We may want to move it to a new file, and maybe throw some warnings that it's not thoroughly tested with all plugins.

It also makes me think that we could simplify, or at least clarify some of the dataset loading/path configuration/dependency setup that's currently in xpublish.Rest.

If we are building things out to support servers, multiple datasets should be the default mode of operation of xpublish.Rest. Then maybe we have a subclass (xpublish.SingleDataset?) that overrides paths and dependencies. Then the accessor can be built upon the single dataset subclass.

benbovy Dec 14, 2022
Maintainer

It's really effective at showing the power of Xpublish to a new user

Yeah the accessor looks great in small "getting started" examples and showcases. It is basically some syntactic sugar on top of xpublish.Rest, most advanced cases will use the latter.

It also makes me think that we could simplify, or at least clarify some of the dataset loading/path configuration/dependency setup that's currently in xpublish.Rest.

Agreed, something like an xpublish.RestOneDataset would help clarifying things.

I'd prefer to make sure that the core has the right extension points, so that all of those can be provided by plugins rather than baking them all in. Or at least make them easy to override. Maybe if we do provide them with core, make them all implemented as extensions, so we make sure all of the right hooks are there?

I'm currently fighting a tool that has it's own strong opinions on how logging and exception handling should happen, and not quite enough hooks to be able to override and send exceptions to Sentry cleanly, so I've got some opinions on this matter at the moment.

Yes, my suggestion was more to provide some convenient helper elements rather than opinionated framework components, i.e., a bunch of middlewares or dependencies that could actually work with any fastapi app and that we provide in Xpublish core so that users don't need to reinvent the wheel. These could live in separate repos / packages as well, but Xpublish core is a nice place for those basic or common things I think. I see it distinct from API route extension plugins (#140).

I was largely thinking of non- or low-code data managers as the primary user group for distros.

Yes I guess there are multiple possible levels for distros, e.g.,

with conda meta-packages: "give me all the plugins I need to setup the dev environment, build and customize my own server for doing xyz"
with docker images: "give me everything I need to setup, deploy and run my own xyz server, using one command line"

jhamman · 2022-12-13T23:22:17Z

jhamman
Dec 13, 2022
Maintainer

I'll comment on some of the specific bits in the ongoing thread but I wanted to say at the top that I'm very supportive of the high level concepts @abkfenris has laid out here. I've long thought that Xpublish needs a set of opinionated deployment concepts with interchangeable routers.

I do think that success here will turn on the documentation and discoverability of the external routers and deployment distros. To this end, I suggest we consider ways to group the various sub-projects in a way that makes them easy to find/use together (new github org, common docs, etc.).

5 replies

abkfenris Dec 14, 2022
Maintainer Author

Thanks Joe!

Documentation and discoverability is definitely going to be key. It probably would help to figure out a handful of user personas so that we can direct them towards the right info. Data managers, please find distros in aisle 4, and plugins in aisle 5, developers, you'll find how to build a plugin in aisle 8, and the API reference you need if you're building a distro in aisle 12...

I think a new Github org would be a great start, to at least encourage various projects to live in the same place. It also would be good to have a dedicated place for discussions.

I think we can probably expect that we'll hear enough about deployments/distros that we can manually add those to the docs, but plugins may be a bit more chaotic.

One way we may be able to wrangle plugins is to make a cookiecutter and/or template repo for folks to start new plugins with. Then in the setup we can include PyPI classifiers to identify the plugins (Framework::Xpublish::Plugin?), along with the Github Actions workflows for building, testing, and deploying plugins to PyPI.

Then for the Xpublish doc's we can have a periodic Github Actions workflow that looks for new packages with the classifier and it could raise an issue or even make a PR to update the docs.

benbovy Dec 14, 2022
Maintainer

+1 for moving xpublish into its own github organization together with other related repos (plugins) once it is ready.

Also +1 for a cookiecutter (although it might be challenging to keep it up-to-date if things are moving at a fast pace).

Maybe worth looking at how are managed intake drivers or ansible roles as possible references?

abkfenris Dec 15, 2022
Maintainer Author

Also +1 for a cookiecutter (although it might be challenging to keep it up-to-date if things are moving at a fast pace).

It probably would be easier to start with a template repo for quicker iteration, then once things are more settled, we can make a cookiecutter from it. @ocefpaf maintains one for IOOS packages. I've already threatened to draft him to help here.

Maybe worth looking at how are managed intake drivers or ansible roles as possible references?

Quickly looking, Intake looks like folks are left to their own devices, and Ansible is the same.

ocefpaf Dec 15, 2022

The IOOS template is OK for most scientific package but my guess is that we need a plugin template here. A cookie cutter would be nice for this task.

PS: I need to update the IOOS template ASAP! It is dangerously outdated.

abkfenris Dec 15, 2022
Maintainer Author

Oh, I was definitely thinking of a new plugin template rather than reusing the IOOS one as is.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Xpublish as the core of a next-gen data server #139

{{title}}

Replies: 3 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Xpublish as the core of a next-gen data server #139

abkfenris Dec 9, 2022 Maintainer

Where do we go from here?

As an extendable core

As a distro (or many)

As a collections of routers/plugins

While we are here, a few plugin ideas

Similar projects that we may be able to collaborate with

Replies: 3 comments · 9 replies

abkfenris Dec 11, 2022 Maintainer Author

benbovy Dec 13, 2022 Maintainer

abkfenris Dec 13, 2022 Maintainer Author

jhamman Dec 13, 2022 Maintainer

abkfenris Dec 14, 2022 Maintainer Author

benbovy Dec 14, 2022 Maintainer

jhamman Dec 13, 2022 Maintainer

abkfenris Dec 14, 2022 Maintainer Author

benbovy Dec 14, 2022 Maintainer

abkfenris Dec 15, 2022 Maintainer Author

ocefpaf Dec 15, 2022

abkfenris Dec 15, 2022 Maintainer Author

abkfenris
Dec 9, 2022
Maintainer

Replies: 3 comments 9 replies

abkfenris
Dec 11, 2022
Maintainer Author

benbovy
Dec 13, 2022
Maintainer

abkfenris Dec 13, 2022
Maintainer Author

jhamman Dec 13, 2022
Maintainer

abkfenris Dec 14, 2022
Maintainer Author

benbovy Dec 14, 2022
Maintainer

jhamman
Dec 13, 2022
Maintainer

abkfenris Dec 14, 2022
Maintainer Author

benbovy Dec 14, 2022
Maintainer

abkfenris Dec 15, 2022
Maintainer Author

abkfenris Dec 15, 2022
Maintainer Author