Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add additional str accessor methods for DataArray #4622

Merged
merged 15 commits into from
Mar 11, 2021
Merged

Conversation

toddrjen
Copy link
Contributor

@toddrjen toddrjen commented Nov 30, 2020

This implements the following additional string accessor methods, based loosely on the versions in pandas:

One-to-one

  • casefold(self)
  • normalize(self, form)

One-to-many

  • extract(self, pat[, flags, expand])
  • extractall(self, pat[, flags])
  • findall(self, pat[, flags])
  • get_dummies(self[, sep])
  • partition(self[, sep, expand])
  • rpartition(self[, sep, expand])
  • rsplit(self[, pat, n, expand])
  • split(self[, pat, n, expand])

Many-to-one

  • cat(self[, others, sep, na_rep, join])
  • join(self, sep)

Operators

  • +
  • *
  • %

Other

  • Allow vectorized arguments.

  • Closes ENH: Support more of the pandas str accessors #3940

  • Tests added

  • Passes isort . && black . && mypy . && flake8

  • User visible changes (including notable bug fixes) are documented in whats-new.rst

  • New functions/methods are listed in api.rst

Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @toddrjen — this is very impressive and quite a lot! Let me / others take some time this week to go through it more.

I took a look though split & rsplit as a sample, which look excellent. Thanks for the docstrings — those make them very clear.

xarray/core/accessor_str.py Outdated Show resolved Hide resolved
@toddrjen toddrjen force-pushed the str branch 2 times, most recently from 959b325 to c560a1a Compare November 30, 2020 05:27
Copy link
Collaborator

@mathause mathause left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a quick look and added some suggestions/ questions. I'll need to give it a more thorough review but looks good so far.

xarray/core/accessor_str.py Outdated Show resolved Hide resolved
xarray/core/accessor_str.py Outdated Show resolved Hide resolved
xarray/core/accessor_str.py Outdated Show resolved Hide resolved
xarray/core/accessor_str.py Show resolved Hide resolved
xarray/core/accessor_str.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@mathause mathause left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went over the code and added a lot of comments. But mostly because it's so long. As already mentioned looks good overall.

My main points:

  • I'd set a default for the name of new dimensions e.g. group_dim: Hashable = "group". I think that's a good choice in most cases.
  • I'd recommend to refactor the complicated XX and rXX methods, e.g. split and rsplit (see inline comments).

Could you add the methods to the docs:

I have not checked the tests yet.

xarray/core/accessor_str.py Outdated Show resolved Hide resolved
xarray/core/accessor_str.py Outdated Show resolved Hide resolved
xarray/core/accessor_str.py Show resolved Hide resolved
xarray/core/accessor_str.py Show resolved Hide resolved
xarray/core/accessor_str.py Outdated Show resolved Hide resolved
xarray/core/accessor_str.py Outdated Show resolved Hide resolved
xarray/core/accessor_str.py Outdated Show resolved Hide resolved
xarray/core/accessor_str.py Outdated Show resolved Hide resolved
xarray/core/accessor_str.py Show resolved Hide resolved
xarray/core/accessor_str.py Show resolved Hide resolved
@toddrjen
Copy link
Contributor Author

toddrjen commented Dec 3, 2020

  • I'd set a default for the name of new dimensions e.g. group_dim: Hashable = "group". I think that's a good choice in most cases.

I thought about doing this at first. However, this could lead to conflicts if the DataArray already has a dimension with that name, which would be a particular problem if people chained together multiple such operations. So I checked what default name xarray uses elsewhere, and it doesn't seem to use default names for the most part (the main exception being DataArray creation). So I think that, in order to avoid unexpected behavior, and to keep consistency, not automatically choosing a name is a better option.

Copy link
Collaborator

@mathause mathause left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, this could lead to conflicts if the DataArray already has a dimension with that name, which would be a particular problem if people chained together multiple such operations.

That should raise a KeyError, no?

So I checked what default name xarray uses elsewhere, and it doesn't seem to use default names for the most part (the main exception being DataArray creation)

Yes this is true. I'd still prefer defaults but I leave it up to you.


I had a short look at the tests (could not go through them all right now). Again, what I saw looks good. Some things that I noticed:

  • The test coverage is very good.
  • Some of the tests could probably be simplified, to make them easier to read. E.g. when you try to raise an error.
  • we usually add a match to the pytest.raises. This also helps to understand what you are testing.
  • assert_equal should raise an error if the dtype does not match, so you should not need to add all the assert result.dtype == expected.dtype. Consider
import numpy as np
import xarray as xr

a = xr.DataArray(np.array("a", dtype=np.str_))
b = xr.DataArray(np.array("a", dtype=np.bytes_))

xr.testing.assert_equal(a, b)

xarray/tests/test_accessor_str.py Outdated Show resolved Hide resolved
xarray/tests/test_accessor_str.py Outdated Show resolved Hide resolved
xarray/tests/test_accessor_str.py Outdated Show resolved Hide resolved
xarray/tests/test_accessor_str.py Outdated Show resolved Hide resolved
xarray/tests/test_accessor_str.py Show resolved Hide resolved
xarray/tests/test_accessor_str.py Outdated Show resolved Hide resolved
xarray/tests/test_accessor_str.py Outdated Show resolved Hide resolved
xarray/core/accessor_str.py Show resolved Hide resolved
xarray/core/accessor_str.py Show resolved Hide resolved
xarray/tests/test_accessor_str.py Outdated Show resolved Hide resolved
@mathause
Copy link
Collaborator

Just wanted to let you know that we are definitively interested in this contribution! I'd start by removing the dtype assertions again, that makes the diff of the tests much smaller and more digestible. Unless there is a reason for them? Does this not work correctly in assert_equal?

@toddrjen
Copy link
Contributor Author

@mathause

Sorry for the delay, I have been swamped at work. I probably won't have any time to work on this before Christmas.

I have finished implementing the cat and join methods, and I implemented +, *, and % operator support.

I am currently working on improving the vectorization of some of the functions. The idea is that some arguments, like for example the regular expression pattern or the number of repetitions in rep, will be able to be given an array-like, with the dimensions being broadcast against the original DataArray.

This can be useful, for example, if a DataArray combines data of different formats along a dimension (ideally this wouldn't be the case but people don't always have that much control over the data they get). Or it could be used to create an ASCII bar chart where the number of symbols is equal to the value in an array element.

However, this could lead to conflicts if the DataArray already has a dimension with that name, which would be a particular problem if people chained together multiple such operations.

That should raise a KeyError, no?

Yes, but I think it would be strange if using the default parameters once works fine, but using them twice or more in a row somehow returns an exception. I think the defaults should either work generally or not be defaults at all. That is just my opinion. More fundamentally, it is just inconsistent with how xarray works elsewhere and so I think it would be unexpected.

  • Some of the tests could probably be simplified, to make them easier to read. E.g. when you try to raise an error.

Please point out the specific cases if you haven't already done so.

  • we usually add a match to the pytest.raises. This also helps to understand what you are testing.

I will add this.

  • assert_equal should raise an error if the dtype does not match, so you should not need to add all the assert result.dtype == expected.dtype.

It doesn't work with an object dtype:

>>> import numpy as np 
>>> import xarray as xr 
>>>   
>>> a = xr.DataArray(np.array("a", dtype=np.str_)) 
>>> b = a.astype(np.object_)                     
>>> a.dtype == b.dtype                                                                                                                                                                                                                        
False
>>> a.equals(b)
True
>>> xr.testing.assert_equal(a, b)

This does not raise an exception on my machine at least. I ran into several cases where I was incorrectly getting object dtypes and the tests weren't catching it, hence the dtype checks.

@mathause
Copy link
Collaborator

Sorry for the delay, I have been swamped at work. I probably won't have any time to work on this before Christmas.

No probem at all, just wanted to check in.

Yes you are correct concerning the dtype. This comes back to numpy, where the following returns true

np.array("a") == np.array("a", dtype=object)

I wonder if that's the right choice... Thus, you are right to change the tests.

@toddrjen
Copy link
Contributor Author

@mathause One possibility might be to make xr.testing.assert_identical match dtypes. I can see different dtypes being "equal", but not "identical".

@mathause
Copy link
Collaborator

I opened an issue regarding the dtypes check. Let's see what the others think.

Copy link
Collaborator

@keewis keewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got a few more suggestions, most of them regarding the docstrings.

Could you also add the new methods to the DataArray.str section in api.rst?

xarray/core/accessor_str.py Outdated Show resolved Hide resolved
xarray/core/accessor_str.py Outdated Show resolved Hide resolved
xarray/core/accessor_str.py Outdated Show resolved Hide resolved
xarray/core/accessor_str.py Outdated Show resolved Hide resolved
xarray/core/accessor_str.py Outdated Show resolved Hide resolved
xarray/core/accessor_str.py Outdated Show resolved Hide resolved
xarray/core/accessor_str.py Outdated Show resolved Hide resolved
xarray/core/accessor_str.py Outdated Show resolved Hide resolved
xarray/core/accessor_str.py Outdated Show resolved Hide resolved
xarray/core/accessor_str.py Outdated Show resolved Hide resolved
@toddrjen
Copy link
Contributor Author

@keewis Thanks for the suggestions. I will add everything to the relevant documentation when I have everything completed and the changes are agreed upon.

@keewis
Copy link
Collaborator

keewis commented Dec 30, 2020

api.rst contains a structured enumeration of methods autosummary should generate documentation pages for, so there's not much to do, and it would allow us to check the rendered version of the docstrings.

@toddrjen
Copy link
Contributor Author

The latest version I just pushed should have the requested changes. It also has cat, join, +, *, %. I have also implemented broadcasting for many (but not all) of the functions I plan to implement it for so you can see some examples of how it works.

Copy link
Collaborator

@mathause mathause left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had another look and have some more comments. However, I think none of them are blocking - so I'd say you implement the suggestions you deem worthy & tell me when you are happy. Then we give the others another day or two to comment & then I'll merge.


  • I think code blocks in the docstring need a double backticks (`). Would be nice if you can fix those.
  • I think is more usual to only have one space after a full stop in our docs.
  • broadcast
    • Usually xarray uses join="outer" when broadcasting. Here join="exact" is used. This makes total sense as missing values are not handled. But this might be surprising for users.
    • I think the formulation "If array-like, it is broadcast." could be improved.
    • How about Array-like input is broadcast using join="exact"?
  • In your examples you directly showcase the full functionality. E.g. in cat you have two arrays with different dimensions, 3 scalar inputs and the separator is along a third dimension all in one example. This can be quite difficult to wrap my head around. I would probably have made 3 examples out of this (1D array + 1 or 2 scalars + scalar sep; 2 x 1D arrays; 1 D array + 1D sep), making it easier to digest.
  • The same holds for the tests.
  • These last two points do not mean you have to change them (all), and I appreciate how thorough you were, thinking of all the corner cases. However, from a usability and maintainability perspective less can also sometimes be more.

xarray/core/accessor_str.py Outdated Show resolved Hide resolved
xarray/core/accessor_str.py Outdated Show resolved Hide resolved
xarray/core/accessor_str.py Outdated Show resolved Hide resolved
xarray/core/accessor_str.py Show resolved Hide resolved
xarray/core/accessor_str.py Outdated Show resolved Hide resolved
xarray/tests/test_accessor_str.py Show resolved Hide resolved
xarray/tests/test_accessor_str.py Outdated Show resolved Hide resolved
xarray/tests/test_accessor_str.py Show resolved Hide resolved
xarray/tests/test_accessor_str.py Outdated Show resolved Hide resolved
xarray/tests/test_accessor_str.py Outdated Show resolved Hide resolved
@toddrjen toddrjen changed the title [WIP] Add additional str accessor methods for DataArray Add additional str accessor methods for DataArray Mar 6, 2021
@toddrjen
Copy link
Contributor Author

toddrjen commented Mar 6, 2021

The version here should be complete, in that all planned features are implemented, although of course there may be additional changes. So I removed the [WIP] part and updated whats-new.rst and others. I squashed my commits down and force-pushed to get a clean look at things. Please take a look and tell me what you think.

@toddrjen
Copy link
Contributor Author

toddrjen commented Mar 7, 2021

All tests now pass as well.

Copy link
Member

@andersy005 andersy005 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution, @toddrjen!

@max-sixty
Copy link
Collaborator

Thank you very much @toddrjen — it's a huge contribution!

@max-sixty
Copy link
Collaborator

I fixed a conflict and am merging.

Thanks @toddrjen ! This is a very significant contribution.

@max-sixty max-sixty merged commit 6ff27ca into pydata:master Mar 11, 2021
@mathause
Copy link
Collaborator

Thanks - awesome! Even going through the code took ages so kudos for sticking with it!

dcherian added a commit to dcherian/xarray that referenced this pull request Mar 18, 2021
…indow

* upstream/master:
  Fix regression in decoding large standard calendar times (pydata#5050)
  Fix sticky sidebar responsiveness on small screens (pydata#5039)
  Flexible indexes refactoring notes (pydata#4979)
  add a install xarray step to the upstream-dev CI (pydata#5044)
  Adds Dataset.query() method, analogous to pandas DataFrame.query() (pydata#4984)
  run tests on python 3.9 (pydata#5040)
  Add date attribute to datetime accessor (pydata#4994)
  📚 New theme & rearrangement of the docs (pydata#4835)
  upgrade ci-trigger to the most recent version (pydata#5037)
  GH5005 fix documentation on open_rasterio (pydata#5021)
  GHA for automatically canceling previous CI runs (pydata#5025)
  Implement GroupBy.__getitem__ (pydata#3691)
  conventions: decode unsigned integers to signed if _Unsigned=false (pydata#4966)
  Added support for numpy.bool_ (pydata#4986)
  Add additional str accessor methods for DataArray (pydata#4622)
dcherian added a commit to dcherian/xarray that referenced this pull request Mar 23, 2021
…-tasks

* upstream/master:
  Fix regression in decoding large standard calendar times (pydata#5050)
  Fix sticky sidebar responsiveness on small screens (pydata#5039)
  Flexible indexes refactoring notes (pydata#4979)
  add a install xarray step to the upstream-dev CI (pydata#5044)
  Adds Dataset.query() method, analogous to pandas DataFrame.query() (pydata#4984)
  run tests on python 3.9 (pydata#5040)
  Add date attribute to datetime accessor (pydata#4994)
  📚 New theme & rearrangement of the docs (pydata#4835)
  upgrade ci-trigger to the most recent version (pydata#5037)
  GH5005 fix documentation on open_rasterio (pydata#5021)
  GHA for automatically canceling previous CI runs (pydata#5025)
  Implement GroupBy.__getitem__ (pydata#3691)
  conventions: decode unsigned integers to signed if _Unsigned=false (pydata#4966)
  Added support for numpy.bool_ (pydata#4986)
  Add additional str accessor methods for DataArray (pydata#4622)
  add polyval to polyfit see also (pydata#5020)
  mention map_blocks in the docstring of apply_ufunc (pydata#5011)
  Switch backend API to v2 (pydata#4989)
  WIP: add new backend api documentation (pydata#4810)
  pin netCDF4=1.5.3 in min-all-deps (pydata#4982)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Support more of the pandas str accessors
5 participants