-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add additional str accessor methods for DataArray #4622
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @toddrjen — this is very impressive and quite a lot! Let me / others take some time this week to go through it more.
I took a look though split
& rsplit
as a sample, which look excellent. Thanks for the docstrings — those make them very clear.
959b325
to
c560a1a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a quick look and added some suggestions/ questions. I'll need to give it a more thorough review but looks good so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went over the code and added a lot of comments. But mostly because it's so long. As already mentioned looks good overall.
My main points:
- I'd set a default for the name of new dimensions e.g.
group_dim: Hashable = "group"
. I think that's a good choice in most cases. - I'd recommend to refactor the complicated
XX
andrXX
methods, e.g.split
andrsplit
(see inline comments).
Could you add the methods to the docs:
- https://github.com/pydata/xarray/blob/master/doc/api.rst#string-manipulation
- https://github.com/pydata/xarray/blob/master/doc/api-hidden.rst
I have not checked the tests yet.
I thought about doing this at first. However, this could lead to conflicts if the DataArray already has a dimension with that name, which would be a particular problem if people chained together multiple such operations. So I checked what default name xarray uses elsewhere, and it doesn't seem to use default names for the most part (the main exception being DataArray creation). So I think that, in order to avoid unexpected behavior, and to keep consistency, not automatically choosing a name is a better option. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, this could lead to conflicts if the DataArray already has a dimension with that name, which would be a particular problem if people chained together multiple such operations.
That should raise a KeyError, no?
So I checked what default name xarray uses elsewhere, and it doesn't seem to use default names for the most part (the main exception being DataArray creation)
Yes this is true. I'd still prefer defaults but I leave it up to you.
I had a short look at the tests (could not go through them all right now). Again, what I saw looks good. Some things that I noticed:
- The test coverage is very good.
- Some of the tests could probably be simplified, to make them easier to read. E.g. when you try to raise an error.
- we usually add a
match
to thepytest.raises
. This also helps to understand what you are testing. assert_equal
should raise an error if the dtype does not match, so you should not need to add all theassert result.dtype == expected.dtype
. Consider
import numpy as np
import xarray as xr
a = xr.DataArray(np.array("a", dtype=np.str_))
b = xr.DataArray(np.array("a", dtype=np.bytes_))
xr.testing.assert_equal(a, b)
Just wanted to let you know that we are definitively interested in this contribution! I'd start by removing the dtype assertions again, that makes the diff of the tests much smaller and more digestible. Unless there is a reason for them? Does this not work correctly in |
Sorry for the delay, I have been swamped at work. I probably won't have any time to work on this before Christmas. I have finished implementing the I am currently working on improving the vectorization of some of the functions. The idea is that some arguments, like for example the regular expression pattern or the number of repetitions in rep, will be able to be given an array-like, with the dimensions being broadcast against the original DataArray. This can be useful, for example, if a DataArray combines data of different formats along a dimension (ideally this wouldn't be the case but people don't always have that much control over the data they get). Or it could be used to create an ASCII bar chart where the number of symbols is equal to the value in an array element.
Yes, but I think it would be strange if using the default parameters once works fine, but using them twice or more in a row somehow returns an exception. I think the defaults should either work generally or not be defaults at all. That is just my opinion. More fundamentally, it is just inconsistent with how xarray works elsewhere and so I think it would be unexpected.
Please point out the specific cases if you haven't already done so.
I will add this.
It doesn't work with an object dtype: >>> import numpy as np
>>> import xarray as xr
>>>
>>> a = xr.DataArray(np.array("a", dtype=np.str_))
>>> b = a.astype(np.object_)
>>> a.dtype == b.dtype
False
>>> a.equals(b)
True
>>> xr.testing.assert_equal(a, b) This does not raise an exception on my machine at least. I ran into several cases where I was incorrectly getting object dtypes and the tests weren't catching it, hence the dtype checks. |
No probem at all, just wanted to check in. Yes you are correct concerning the dtype. This comes back to numpy, where the following returns true np.array("a") == np.array("a", dtype=object) I wonder if that's the right choice... Thus, you are right to change the tests. |
@mathause One possibility might be to make |
I opened an issue regarding the dtypes check. Let's see what the others think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've got a few more suggestions, most of them regarding the docstrings.
Could you also add the new methods to the DataArray.str
section in api.rst
?
@keewis Thanks for the suggestions. I will add everything to the relevant documentation when I have everything completed and the changes are agreed upon. |
|
The latest version I just pushed should have the requested changes. It also has |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had another look and have some more comments. However, I think none of them are blocking - so I'd say you implement the suggestions you deem worthy & tell me when you are happy. Then we give the others another day or two to comment & then I'll merge.
- I think code blocks in the docstring need a double backticks (`). Would be nice if you can fix those.
- I think is more usual to only have one space after a full stop in our docs.
- broadcast
- Usually xarray uses
join="outer"
when broadcasting. Herejoin="exact"
is used. This makes total sense as missing values are not handled. But this might be surprising for users. - I think the formulation
"If array-like, it is broadcast."
could be improved. - How about
Array-like input is broadcast using join="exact"
?
- Usually xarray uses
- In your examples you directly showcase the full functionality. E.g. in
cat
you have two arrays with different dimensions, 3 scalar inputs and the separator is along a third dimension all in one example. This can be quite difficult to wrap my head around. I would probably have made 3 examples out of this (1D array + 1 or 2 scalars + scalar sep; 2 x 1D arrays; 1 D array + 1D sep), making it easier to digest. - The same holds for the tests.
- These last two points do not mean you have to change them (all), and I appreciate how thorough you were, thinking of all the corner cases. However, from a usability and maintainability perspective less can also sometimes be more.
The version here should be complete, in that all planned features are implemented, although of course there may be additional changes. So I removed the |
All tests now pass as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution, @toddrjen!
Thank you very much @toddrjen — it's a huge contribution! |
I fixed a conflict and am merging. Thanks @toddrjen ! This is a very significant contribution. |
Thanks - awesome! Even going through the code took ages so kudos for sticking with it! |
…indow * upstream/master: Fix regression in decoding large standard calendar times (pydata#5050) Fix sticky sidebar responsiveness on small screens (pydata#5039) Flexible indexes refactoring notes (pydata#4979) add a install xarray step to the upstream-dev CI (pydata#5044) Adds Dataset.query() method, analogous to pandas DataFrame.query() (pydata#4984) run tests on python 3.9 (pydata#5040) Add date attribute to datetime accessor (pydata#4994) 📚 New theme & rearrangement of the docs (pydata#4835) upgrade ci-trigger to the most recent version (pydata#5037) GH5005 fix documentation on open_rasterio (pydata#5021) GHA for automatically canceling previous CI runs (pydata#5025) Implement GroupBy.__getitem__ (pydata#3691) conventions: decode unsigned integers to signed if _Unsigned=false (pydata#4966) Added support for numpy.bool_ (pydata#4986) Add additional str accessor methods for DataArray (pydata#4622)
…-tasks * upstream/master: Fix regression in decoding large standard calendar times (pydata#5050) Fix sticky sidebar responsiveness on small screens (pydata#5039) Flexible indexes refactoring notes (pydata#4979) add a install xarray step to the upstream-dev CI (pydata#5044) Adds Dataset.query() method, analogous to pandas DataFrame.query() (pydata#4984) run tests on python 3.9 (pydata#5040) Add date attribute to datetime accessor (pydata#4994) 📚 New theme & rearrangement of the docs (pydata#4835) upgrade ci-trigger to the most recent version (pydata#5037) GH5005 fix documentation on open_rasterio (pydata#5021) GHA for automatically canceling previous CI runs (pydata#5025) Implement GroupBy.__getitem__ (pydata#3691) conventions: decode unsigned integers to signed if _Unsigned=false (pydata#4966) Added support for numpy.bool_ (pydata#4986) Add additional str accessor methods for DataArray (pydata#4622) add polyval to polyfit see also (pydata#5020) mention map_blocks in the docstring of apply_ufunc (pydata#5011) Switch backend API to v2 (pydata#4989) WIP: add new backend api documentation (pydata#4810) pin netCDF4=1.5.3 in min-all-deps (pydata#4982)
This implements the following additional string accessor methods, based loosely on the versions in pandas:
One-to-one
One-to-many
Many-to-one
Operators
+
*
%
Other
Allow vectorized arguments.
Closes ENH: Support more of the pandas str accessors #3940
Tests added
Passes
isort . && black . && mypy . && flake8
User visible changes (including notable bug fixes) are documented in
whats-new.rst
New functions/methods are listed in
api.rst