Skip to content
This repository has been archived by the owner on Jan 12, 2024. It is now read-only.

Constrain allowable years and states for filtering #9

Closed
zaneselvans opened this issue Apr 7, 2022 · 2 comments
Closed

Constrain allowable years and states for filtering #9

zaneselvans opened this issue Apr 7, 2022 · 2 comments
Labels
epacems The EPA's Continuous Emissions Monitoring System hourly dataset intake Intake data catalogs parquet Apache Parquet is an open columnar data file format.

Comments

@zaneselvans
Copy link
Member

The EPA CEMS dataset is composed of ~1300 row groups, each containing a unique combination of year and state to allow efficient pushdown filtering by time and location. Only a certain range of years (1995-2020) and set of state abbreviations (continental US plus DC) are valid for filtering. It would be nice if we could at least suggest, and preferably require that users only attempt to filter with valid values, so that if they ask for something outside of the allowable values they get an error, rather than waiting a long time for a query that won't give them anything useful.

Is this easy to set up with the intake catalog? Can we designate an allowable set of values for years and states to be used as filters? How are user parameters meant to be used? I've seen that you can enumerate allowable values there, but they seem only to be for use in Jinja templating of the filenames, and not for things like the filters.

@zaneselvans zaneselvans added intake Intake data catalogs epacems The EPA's Continuous Emissions Monitoring System hourly dataset parquet Apache Parquet is an open columnar data file format. labels Apr 7, 2022
@zaneselvans
Copy link
Member Author

This doesn't appear to be a way we can use the parameters -- they seem to be able only to select a single file path at a time. To pass the DNF filters through to Dask/Pandas we won't be able to constrain the allowable values. See this comment and this example

@zaneselvans
Copy link
Member Author

Closing this as it doesn't seem to be workable.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
epacems The EPA's Continuous Emissions Monitoring System hourly dataset intake Intake data catalogs parquet Apache Parquet is an open columnar data file format.
Projects
None yet
Development

No branches or pull requests

1 participant