You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 12, 2024. It is now read-only.
zaneselvans opened this issue
Apr 7, 2022
· 2 comments
Labels
epacemsThe EPA's Continuous Emissions Monitoring System hourly datasetintakeIntake data catalogsparquetApache Parquet is an open columnar data file format.
The EPA CEMS dataset is composed of ~1300 row groups, each containing a unique combination of year and state to allow efficient pushdown filtering by time and location. Only a certain range of years (1995-2020) and set of state abbreviations (continental US plus DC) are valid for filtering. It would be nice if we could at least suggest, and preferably require that users only attempt to filter with valid values, so that if they ask for something outside of the allowable values they get an error, rather than waiting a long time for a query that won't give them anything useful.
Is this easy to set up with the intake catalog? Can we designate an allowable set of values for years and states to be used as filters? How are user parameters meant to be used? I've seen that you can enumerate allowable values there, but they seem only to be for use in Jinja templating of the filenames, and not for things like the filters.
The text was updated successfully, but these errors were encountered:
zaneselvans
added
intake
Intake data catalogs
epacems
The EPA's Continuous Emissions Monitoring System hourly dataset
parquet
Apache Parquet is an open columnar data file format.
labels
Apr 7, 2022
This doesn't appear to be a way we can use the parameters -- they seem to be able only to select a single file path at a time. To pass the DNF filters through to Dask/Pandas we won't be able to constrain the allowable values. See this comment and this example
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Labels
epacemsThe EPA's Continuous Emissions Monitoring System hourly datasetintakeIntake data catalogsparquetApache Parquet is an open columnar data file format.
The EPA CEMS dataset is composed of ~1300 row groups, each containing a unique combination of
year
andstate
to allow efficient pushdown filtering by time and location. Only a certain range of years (1995-2020) and set of state abbreviations (continental US plus DC) are valid for filtering. It would be nice if we could at least suggest, and preferably require that users only attempt to filter with valid values, so that if they ask for something outside of the allowable values they get an error, rather than waiting a long time for a query that won't give them anything useful.Is this easy to set up with the intake catalog? Can we designate an allowable set of values for years and states to be used as filters? How are user
parameters
meant to be used? I've seen that you can enumerate allowable values there, but they seem only to be for use in Jinja templating of the filenames, and not for things like the filters.The text was updated successfully, but these errors were encountered: