-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds Dataset.query() method, analogous to pandas DataFrame.query() #4984
Conversation
Hi folks, thought I'd put up a proof of concept PR here for further discussion. Any advice/suggestions about if/how to take this forward would be very welcome. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alimanfoo, this looks like a great start. And forgive taking a few days to respond?
Does the pd.eval
work with more than two dimensions? Great if so! This would be a very high impact per line of code :)
Please could we add some tests for that?
Hi @max-sixty, no problem. Re this...
...not quite sure what you mean, could you elaborate? |
c11e98a
to
a41e805
Compare
Just to mention I've added tests to verify this works with variables backed by dask arrays. Also added explicit tests of different eval engine and query parser options. And added a docstring. |
For sure — forgive me if I wasn't clear. Currently the test runs over an array of two dimensions — |
No worries, yes any number of dimensions can be queried. I've added tests showing three dimensions can be queried. As an aside, in writing these tests I came upon a probable upstream bug in pandas, reported as pandas-dev/pandas#40436. I don't think this affects this PR though, and has low impact as only the "python" query parser is affected, and most people will use the default "pandas" query parser. |
Great re the dimensions! I reviewed the tests more fully, they look great. It looks like we need a Could we add a simple method to And we should add the methods to Does anyone have any other thoughts? I think the API is very reasonable. I could imagine a more sophisticated API that could take a single query, rather than a dict of them by dimension — currently it's |
Co-authored-by: Maximilian Roos <5635139+max-sixty@users.noreply.github.com>
Hi @max-sixty,
Sure, done.
Done.
Done. Let me know if there's anything else. Looking forward to using this 😄 |
Excellent! Could we add a very small test for the DataArray? Given the coverage on Dataset, it should mostly just test that the method works. Any thoughts from others before we merge? |
No problem, some DataArray tests are there.
Good to go from my side. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code LGTM. I didn't check the tests closely but they seem v. thorough. Thanks @alimanfoo!
In a future PR, it would be good to add some docs comparing this to using .where
: https://xarray.pydata.org/en/latest/user-guide/indexing.html#masking-with-where
Great, merging! Seconded re the docs! |
Yay, first xarray PR 🥳 |
…indow * upstream/master: Fix regression in decoding large standard calendar times (pydata#5050) Fix sticky sidebar responsiveness on small screens (pydata#5039) Flexible indexes refactoring notes (pydata#4979) add a install xarray step to the upstream-dev CI (pydata#5044) Adds Dataset.query() method, analogous to pandas DataFrame.query() (pydata#4984) run tests on python 3.9 (pydata#5040) Add date attribute to datetime accessor (pydata#4994) 📚 New theme & rearrangement of the docs (pydata#4835) upgrade ci-trigger to the most recent version (pydata#5037) GH5005 fix documentation on open_rasterio (pydata#5021) GHA for automatically canceling previous CI runs (pydata#5025) Implement GroupBy.__getitem__ (pydata#3691) conventions: decode unsigned integers to signed if _Unsigned=false (pydata#4966) Added support for numpy.bool_ (pydata#4986) Add additional str accessor methods for DataArray (pydata#4622)
…-tasks * upstream/master: Fix regression in decoding large standard calendar times (pydata#5050) Fix sticky sidebar responsiveness on small screens (pydata#5039) Flexible indexes refactoring notes (pydata#4979) add a install xarray step to the upstream-dev CI (pydata#5044) Adds Dataset.query() method, analogous to pandas DataFrame.query() (pydata#4984) run tests on python 3.9 (pydata#5040) Add date attribute to datetime accessor (pydata#4994) 📚 New theme & rearrangement of the docs (pydata#4835) upgrade ci-trigger to the most recent version (pydata#5037) GH5005 fix documentation on open_rasterio (pydata#5021) GHA for automatically canceling previous CI runs (pydata#5025) Implement GroupBy.__getitem__ (pydata#3691) conventions: decode unsigned integers to signed if _Unsigned=false (pydata#4966) Added support for numpy.bool_ (pydata#4986) Add additional str accessor methods for DataArray (pydata#4622) add polyval to polyfit see also (pydata#5020) mention map_blocks in the docstring of apply_ufunc (pydata#5011) Switch backend API to v2 (pydata#4989) WIP: add new backend api documentation (pydata#4810) pin netCDF4=1.5.3 in min-all-deps (pydata#4982)
This PR adds a
Dataset.query()
method which enables making a selection from a dataset based on values in one or more data variables, where the selection is given as a query expression to be evaluated against the data variables in the dataset. See also discussion.pre-commit run --all-files
whats-new.rst
api.rst