Add additional str accessor methods for DataArray #4622

toddrjen · 2020-11-30T02:48:36Z

This implements the following additional string accessor methods, based loosely on the versions in pandas:

One-to-one

casefold(self)
normalize(self, form)

One-to-many

extract(self, pat[, flags, expand])
extractall(self, pat[, flags])
findall(self, pat[, flags])
get_dummies(self[, sep])
partition(self[, sep, expand])
rpartition(self[, sep, expand])
rsplit(self[, pat, n, expand])
split(self[, pat, n, expand])

Many-to-one

cat(self[, others, sep, na_rep, join])
join(self, sep)

Operators

+
*
%

Other

Allow vectorized arguments.
Closes ENH: Support more of the pandas str accessors #3940
Tests added
Passes isort . && black . && mypy . && flake8
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

max-sixty

Hi @toddrjen — this is very impressive and quite a lot! Let me / others take some time this week to go through it more.

I took a look though split & rsplit as a sample, which look excellent. Thanks for the docstrings — those make them very clear.

xarray/core/accessor_str.py

mathause

I had a quick look and added some suggestions/ questions. I'll need to give it a more thorough review but looks good so far.

xarray/core/accessor_str.py

mathause

I went over the code and added a lot of comments. But mostly because it's so long. As already mentioned looks good overall.

My main points:

I'd set a default for the name of new dimensions e.g. group_dim: Hashable = "group". I think that's a good choice in most cases.
I'd recommend to refactor the complicated XX and rXX methods, e.g. split and rsplit (see inline comments).

Could you add the methods to the docs:

I have not checked the tests yet.

xarray/core/accessor_str.py

toddrjen · 2020-12-03T02:42:24Z

I'd set a default for the name of new dimensions e.g. group_dim: Hashable = "group". I think that's a good choice in most cases.

I thought about doing this at first. However, this could lead to conflicts if the DataArray already has a dimension with that name, which would be a particular problem if people chained together multiple such operations. So I checked what default name xarray uses elsewhere, and it doesn't seem to use default names for the most part (the main exception being DataArray creation). So I think that, in order to avoid unexpected behavior, and to keep consistency, not automatically choosing a name is a better option.

mathause

However, this could lead to conflicts if the DataArray already has a dimension with that name, which would be a particular problem if people chained together multiple such operations.

That should raise a KeyError, no?

So I checked what default name xarray uses elsewhere, and it doesn't seem to use default names for the most part (the main exception being DataArray creation)

Yes this is true. I'd still prefer defaults but I leave it up to you.

I had a short look at the tests (could not go through them all right now). Again, what I saw looks good. Some things that I noticed:

The test coverage is very good.
Some of the tests could probably be simplified, to make them easier to read. E.g. when you try to raise an error.
we usually add a match to the pytest.raises. This also helps to understand what you are testing.
assert_equal should raise an error if the dtype does not match, so you should not need to add all the assert result.dtype == expected.dtype. Consider

import numpy as np
import xarray as xr

a = xr.DataArray(np.array("a", dtype=np.str_))
b = xr.DataArray(np.array("a", dtype=np.bytes_))

xr.testing.assert_equal(a, b)

xarray/tests/test_accessor_str.py

xarray/core/accessor_str.py

xarray/tests/test_accessor_str.py

mathause · 2020-12-21T08:52:04Z

Just wanted to let you know that we are definitively interested in this contribution! I'd start by removing the dtype assertions again, that makes the diff of the tests much smaller and more digestible. Unless there is a reason for them? Does this not work correctly in assert_equal?

toddrjen · 2020-12-22T03:04:48Z

@mathause

Sorry for the delay, I have been swamped at work. I probably won't have any time to work on this before Christmas.

I have finished implementing the cat and join methods, and I implemented +, *, and % operator support.

I am currently working on improving the vectorization of some of the functions. The idea is that some arguments, like for example the regular expression pattern or the number of repetitions in rep, will be able to be given an array-like, with the dimensions being broadcast against the original DataArray.

This can be useful, for example, if a DataArray combines data of different formats along a dimension (ideally this wouldn't be the case but people don't always have that much control over the data they get). Or it could be used to create an ASCII bar chart where the number of symbols is equal to the value in an array element.

However, this could lead to conflicts if the DataArray already has a dimension with that name, which would be a particular problem if people chained together multiple such operations.

That should raise a KeyError, no?

Yes, but I think it would be strange if using the default parameters once works fine, but using them twice or more in a row somehow returns an exception. I think the defaults should either work generally or not be defaults at all. That is just my opinion. More fundamentally, it is just inconsistent with how xarray works elsewhere and so I think it would be unexpected.

Some of the tests could probably be simplified, to make them easier to read. E.g. when you try to raise an error.

Please point out the specific cases if you haven't already done so.

we usually add a match to the pytest.raises. This also helps to understand what you are testing.

I will add this.

assert_equal should raise an error if the dtype does not match, so you should not need to add all the assert result.dtype == expected.dtype.

It doesn't work with an object dtype:

>>> import numpy as np 
>>> import xarray as xr 
>>>   
>>> a = xr.DataArray(np.array("a", dtype=np.str_)) 
>>> b = a.astype(np.object_)                     
>>> a.dtype == b.dtype                                                                                                                                                                                                                        
False
>>> a.equals(b)
True
>>> xr.testing.assert_equal(a, b)

This does not raise an exception on my machine at least. I ran into several cases where I was incorrectly getting object dtypes and the tests weren't catching it, hence the dtype checks.

mathause · 2020-12-22T15:09:15Z

Sorry for the delay, I have been swamped at work. I probably won't have any time to work on this before Christmas.

No probem at all, just wanted to check in.

Yes you are correct concerning the dtype. This comes back to numpy, where the following returns true

np.array("a") == np.array("a", dtype=object)

I wonder if that's the right choice... Thus, you are right to change the tests.

toddrjen · 2020-12-23T02:24:39Z

@mathause One possibility might be to make xr.testing.assert_identical match dtypes. I can see different dtypes being "equal", but not "identical".

mathause · 2020-12-23T13:22:05Z

I opened an issue regarding the dtypes check. Let's see what the others think.

keewis

I've got a few more suggestions, most of them regarding the docstrings.

Could you also add the new methods to the DataArray.str section in api.rst?

xarray/core/accessor_str.py

toddrjen · 2020-12-29T05:34:47Z

@keewis Thanks for the suggestions. I will add everything to the relevant documentation when I have everything completed and the changes are agreed upon.

keewis · 2020-12-30T15:27:53Z

api.rst contains a structured enumeration of methods autosummary should generate documentation pages for, so there's not much to do, and it would allow us to check the rendered version of the docstrings.

toddrjen · 2020-12-31T06:43:42Z

The latest version I just pushed should have the requested changes. It also has cat, join, +, *, %. I have also implemented broadcasting for many (but not all) of the functions I plan to implement it for so you can see some examples of how it works.

mathause

I had another look and have some more comments. However, I think none of them are blocking - so I'd say you implement the suggestions you deem worthy & tell me when you are happy. Then we give the others another day or two to comment & then I'll merge.

I think code blocks in the docstring need a double backticks (`). Would be nice if you can fix those.
I think is more usual to only have one space after a full stop in our docs.
broadcast
- Usually xarray uses join="outer" when broadcasting. Here join="exact" is used. This makes total sense as missing values are not handled. But this might be surprising for users.
- I think the formulation "If array-like, it is broadcast." could be improved.
- How about Array-like input is broadcast using join="exact"?
In your examples you directly showcase the full functionality. E.g. in cat you have two arrays with different dimensions, 3 scalar inputs and the separator is along a third dimension all in one example. This can be quite difficult to wrap my head around. I would probably have made 3 examples out of this (1D array + 1 or 2 scalars + scalar sep; 2 x 1D arrays; 1 D array + 1D sep), making it easier to digest.
The same holds for the tests.
These last two points do not mean you have to change them (all), and I appreciate how thorough you were, thinking of all the corner cases. However, from a usability and maintainability perspective less can also sometimes be more.

xarray/core/accessor_str.py

xarray/tests/test_accessor_str.py

toddrjen · 2021-03-06T05:41:39Z

The version here should be complete, in that all planned features are implemented, although of course there may be additional changes. So I removed the [WIP] part and updated whats-new.rst and others. I squashed my commits down and force-pushed to get a clean look at things. Please take a look and tell me what you think.

…ressions

toddrjen · 2021-03-07T06:20:08Z

All tests now pass as well.

andersy005

Thank you for your contribution, @toddrjen!

max-sixty · 2021-03-08T04:24:15Z

Thank you very much @toddrjen — it's a huge contribution!

max-sixty · 2021-03-11T17:49:22Z

I fixed a conflict and am merging.

Thanks @toddrjen ! This is a very significant contribution.

mathause · 2021-03-12T08:44:00Z

Thanks - awesome! Even going through the code took ages so kudos for sticking with it!

…indow * upstream/master: Fix regression in decoding large standard calendar times (pydata#5050) Fix sticky sidebar responsiveness on small screens (pydata#5039) Flexible indexes refactoring notes (pydata#4979) add a install xarray step to the upstream-dev CI (pydata#5044) Adds Dataset.query() method, analogous to pandas DataFrame.query() (pydata#4984) run tests on python 3.9 (pydata#5040) Add date attribute to datetime accessor (pydata#4994) 📚 New theme & rearrangement of the docs (pydata#4835) upgrade ci-trigger to the most recent version (pydata#5037) GH5005 fix documentation on open_rasterio (pydata#5021) GHA for automatically canceling previous CI runs (pydata#5025) Implement GroupBy.__getitem__ (pydata#3691) conventions: decode unsigned integers to signed if _Unsigned=false (pydata#4966) Added support for numpy.bool_ (pydata#4986) Add additional str accessor methods for DataArray (pydata#4622)

…-tasks * upstream/master: Fix regression in decoding large standard calendar times (pydata#5050) Fix sticky sidebar responsiveness on small screens (pydata#5039) Flexible indexes refactoring notes (pydata#4979) add a install xarray step to the upstream-dev CI (pydata#5044) Adds Dataset.query() method, analogous to pandas DataFrame.query() (pydata#4984) run tests on python 3.9 (pydata#5040) Add date attribute to datetime accessor (pydata#4994) 📚 New theme & rearrangement of the docs (pydata#4835) upgrade ci-trigger to the most recent version (pydata#5037) GH5005 fix documentation on open_rasterio (pydata#5021) GHA for automatically canceling previous CI runs (pydata#5025) Implement GroupBy.__getitem__ (pydata#3691) conventions: decode unsigned integers to signed if _Unsigned=false (pydata#4966) Added support for numpy.bool_ (pydata#4986) Add additional str accessor methods for DataArray (pydata#4622) add polyval to polyfit see also (pydata#5020) mention map_blocks in the docstring of apply_ufunc (pydata#5011) Switch backend API to v2 (pydata#4989) WIP: add new backend api documentation (pydata#4810) pin netCDF4=1.5.3 in min-all-deps (pydata#4982)

max-sixty reviewed Nov 30, 2020

View reviewed changes

xarray/core/accessor_str.py Outdated Show resolved Hide resolved

toddrjen force-pushed the str branch 2 times, most recently from 959b325 to c560a1a Compare November 30, 2020 05:27

mathause reviewed Nov 30, 2020

View reviewed changes

mathause reviewed Dec 1, 2020

View reviewed changes

mathause reviewed Dec 7, 2020

View reviewed changes

mathause mentioned this pull request Dec 23, 2020

xr.testing.assert_equal does not test for dtype #4727

Open

keewis reviewed Dec 28, 2020

View reviewed changes

toddrjen force-pushed the str branch from 7af8c4a to 60622d3 Compare December 31, 2020 15:35

mathause approved these changes Jan 5, 2021

View reviewed changes

toddrjen force-pushed the str branch from 60622d3 to 84fc17d Compare March 6, 2021 05:03

toddrjen changed the title ~~[WIP] Add additional str accessor methods for DataArray~~ Add additional str accessor methods for DataArray Mar 6, 2021

toddrjen force-pushed the str branch from 84fc17d to f937d2d Compare March 6, 2021 05:39

toddrjen added 7 commits March 6, 2021 00:57

add type hints for the str accessor class

9ac45fc

allow str accessors to use regular expression objects for regular exp…

7b37c6d

…ressions

implement casefold and normalize str accessor functions

636c166

implement one-to-many str accessor functions

9ea2020

implement cat, join, format, +, *, and %

787216a

support elementwise operations in many str accessor functions

0e80115

update whats-new.rst, api.rst, and api-hidden.rst

1ffc79e

toddrjen force-pushed the str branch from f937d2d to 1ffc79e Compare March 6, 2021 05:58

toddrjen added 6 commits March 6, 2021 09:01

test fixes

19c8a91

implement requested fixes

48d67c7

more fixes

763f979

typing fixes

408a58b

fix docstring

adfd09d

fix more docstring

736c994

andersy005 approved these changes Mar 7, 2021

View reviewed changes

remove encoding header

3473ac3

Merge branch 'master' into str

cd8f7e4

max-sixty merged commit 6ff27ca into pydata:master Mar 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add additional str accessor methods for DataArray #4622

Add additional str accessor methods for DataArray #4622

toddrjen commented Nov 30, 2020 •

edited

Loading

max-sixty left a comment

mathause left a comment

mathause left a comment

toddrjen commented Dec 3, 2020

mathause left a comment

mathause commented Dec 21, 2020

toddrjen commented Dec 22, 2020

mathause commented Dec 22, 2020

toddrjen commented Dec 23, 2020

mathause commented Dec 23, 2020

keewis left a comment •

edited

Loading

toddrjen commented Dec 29, 2020

keewis commented Dec 30, 2020

toddrjen commented Dec 31, 2020

mathause left a comment

toddrjen commented Mar 6, 2021

toddrjen commented Mar 7, 2021

andersy005 left a comment

max-sixty commented Mar 8, 2021

max-sixty commented Mar 11, 2021

mathause commented Mar 12, 2021

Add additional str accessor methods for DataArray #4622

Add additional str accessor methods for DataArray #4622

Conversation

toddrjen commented Nov 30, 2020 • edited Loading

One-to-one

One-to-many

Many-to-one

Operators

Other

max-sixty left a comment

Choose a reason for hiding this comment

mathause left a comment

Choose a reason for hiding this comment

mathause left a comment

Choose a reason for hiding this comment

toddrjen commented Dec 3, 2020

mathause left a comment

Choose a reason for hiding this comment

mathause commented Dec 21, 2020

toddrjen commented Dec 22, 2020

mathause commented Dec 22, 2020

toddrjen commented Dec 23, 2020

mathause commented Dec 23, 2020

keewis left a comment • edited Loading

Choose a reason for hiding this comment

toddrjen commented Dec 29, 2020

keewis commented Dec 30, 2020

toddrjen commented Dec 31, 2020

mathause left a comment

Choose a reason for hiding this comment

toddrjen commented Mar 6, 2021

toddrjen commented Mar 7, 2021

andersy005 left a comment

Choose a reason for hiding this comment

max-sixty commented Mar 8, 2021

max-sixty commented Mar 11, 2021

mathause commented Mar 12, 2021

toddrjen commented Nov 30, 2020 •

edited

Loading

keewis left a comment •

edited

Loading