-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: StringArray construction #36325
PERF: StringArray construction #36325
Conversation
pandas/core/arrays/string_.py
Outdated
@@ -122,6 +122,9 @@ class StringArray(PandasArray): | |||
|
|||
copy : bool, default False | |||
Whether to copy the array of data. | |||
convert : bool, default False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't want this parameter generally in the actual constructors (you can instead do this in `_from_sequence), but even there we don't do coercion. I think this actually needs to be done in the caller. cc @jorisvandenbossche @TomAugspurger
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't want this parameter generally in the actual constructors (you can instead do this in `_from_sequence)
The decision to validate or convert needs to be in the __init__
, because the validation already is there and it has no way to know from the input array, whether the input array only contains strings and NA's. So it needs a flag to tell it if it should validate or convert because validation after conversion is wasteful.
An alternative is to always run the conversion step in StringArray.__init__
, I.e. non-string scalars. In that case validation is not needed at all. It would not cost in performance, after #36162 has been merged:
>>> x = np.array([str(u) for u in range(1_000_000)], dtype=object)
>>> %timeit pd._libs.lib.ensure_string_array(x, copy=False)
11.5 ms ± 145 µs per loop
>>> %timeit pd._libs.lib.is_string_array(x)
13.3 ms ± 80.5 µs per loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I seem to recall we had discussion about trying to avoid this double validation, but don't directly find anything on the original PR #27949
So it needs a flag to tell it if it should validate or convert because validation after conversion is wasteful.
Alternative would be a flag to disable validation (instead of a flag to enable conversion),then _from_sequence
can do the conversion and signal to the constructor it doesn't need to validate.
In general, I prefer the Array class constructors to be minimal and performant (they are not meant for day-to-day users, those should use pd.array(..)
), eg the IntegerArray constructor only checks that the passed values are actually an ndarray and have the correct dtype (and not any other coercion or conversion). But of course, there it is a relatively cheap check, while for object dtype array this check is expensive.
So given that limiation for StringArray using object dtype, I am fine with adding such a "fastpath" keyword in this case (but I would use something like validate
or verify_integrity
)
rebase/lint checks |
I'd prefer not to have a public argument that allows people to create an invalid object.
Can you say a bit more / give an example where we're doing the conversion / validation twice? Keep in mind that even if we know that we have all strings, we want to normalize all the NA values to |
|
Right, Thanks. Thoughts on refactoring |
I'll look into this tonight or tomorrow. |
4ee0f72
to
b15c7d6
Compare
b15c7d6
to
39ea860
Compare
I’ve made a new version. |
thanks @topper-123 |
Currently, when constructing through
Series(data, dtype="string")
, pandas first converts to strings/NA, then does a check that all scalars are actually strings or NA. The check is not needed in cases where we explicitly already have converted.Performance example:
xref #35519, #36304 & #36317.