PERF: StringArray construction #36325

topper-123 · 2020-09-13T06:52:50Z

Currently, when constructing through Series(data, dtype="string"), pandas first converts to strings/NA, then does a check that all scalars are actually strings or NA. The check is not needed in cases where we explicitly already have converted.

Performance example:

>>> x = np.array([str(u) for u in range(1_000_000)], dtype=object)
>>> %timeit pd.Series(x, dtype="string")
357 ms ± 40.2 ms per loop  # v1.1.0
148 ms ± 713 µs per loop  # after #35519
26.3 ms ± 291 µs per loop  # after #36304
12.6 ms ± 115 µs per loop  # this PR

xref #35519, #36304 & #36317.

jreback · 2020-09-13T12:42:27Z

pandas/core/arrays/string_.py

@@ -122,6 +122,9 @@ class StringArray(PandasArray):

    copy : bool, default False
        Whether to copy the array of data.
+    convert : bool, default False


we don't want this parameter generally in the actual constructors (you can instead do this in `_from_sequence), but even there we don't do coercion. I think this actually needs to be done in the caller. cc @jorisvandenbossche @TomAugspurger

we don't want this parameter generally in the actual constructors (you can instead do this in `_from_sequence)

The decision to validate or convert needs to be in the __init__, because the validation already is there and it has no way to know from the input array, whether the input array only contains strings and NA's. So it needs a flag to tell it if it should validate or convert because validation after conversion is wasteful.

An alternative is to always run the conversion step in StringArray.__init__, I.e. non-string scalars. In that case validation is not needed at all. It would not cost in performance, after #36162 has been merged:

>>> x = np.array([str(u) for u in range(1_000_000)], dtype=object) >>> %timeit pd._libs.lib.ensure_string_array(x, copy=False) 11.5 ms ± 145 µs per loop >>> %timeit pd._libs.lib.is_string_array(x) 13.3 ms ± 80.5 µs per loop

I seem to recall we had discussion about trying to avoid this double validation, but don't directly find anything on the original PR #27949

So it needs a flag to tell it if it should validate or convert because validation after conversion is wasteful.

Alternative would be a flag to disable validation (instead of a flag to enable conversion),then _from_sequence can do the conversion and signal to the constructor it doesn't need to validate.

In general, I prefer the Array class constructors to be minimal and performant (they are not meant for day-to-day users, those should use pd.array(..)), eg the IntegerArray constructor only checks that the passed values are actually an ndarray and have the correct dtype (and not any other coercion or conversion). But of course, there it is a relatively cheap check, while for object dtype array this check is expensive.
So given that limiation for StringArray using object dtype, I am fine with adding such a "fastpath" keyword in this case (but I would use something like validate or verify_integrity)

jbrockmendel · 2020-09-14T16:51:24Z

rebase/lint checks

TomAugspurger · 2020-09-14T20:20:15Z

I'd prefer not to have a public argument that allows people to create an invalid object.

The check is not needed in cases where we explicitly already have converted.

Can you say a bit more / give an example where we're doing the conversion / validation twice? Keep in mind that even if we know that we have all strings, we want to normalize all the NA values to pd.NA.

jorisvandenbossche · 2020-09-14T20:28:10Z

Can you say a bit more / give an example where we're doing the conversion / validation twice?

_from_sequence converts the input to an object array of all strings (or nans), and then __init__ validates that the input are all strings (which we know is the case, as _from_sequence already ensured the conversion)

TomAugspurger · 2020-09-14T20:51:13Z

Right, Thanks.

Thoughts on refactoring _from_sequence to use object.__new__(StringArray) and then manually initializing by setting _ndarray and _dtype (and skipping the conversion / checks we don't need?) My preference would be to keep this out of the public API given the relative immaturity of StringArray.

topper-123 · 2020-09-15T08:19:17Z

I'll look into this tonight or tomorrow.

topper-123 · 2020-09-16T14:07:07Z

I’ve made a new version.

jreback · 2020-09-17T16:51:05Z

thanks @topper-123

jreback added Performance Memory or execution speed performance Strings String extension data type and string data labels Sep 13, 2020

jreback requested changes Sep 13, 2020

View reviewed changes

topper-123 added 3 commits September 16, 2020 12:59

PERF: StringArray construction

7abf9ff

add issue number

f1721ac

clean doc string

ee01e02

topper-123 force-pushed the StringArray_constrution_perf branch from 4ee0f72 to b15c7d6 Compare September 16, 2020 12:22

Refactor to avoid call to StringArray__init__ & validation

39ea860

topper-123 force-pushed the StringArray_constrution_perf branch from b15c7d6 to 39ea860 Compare September 16, 2020 13:16

jreback added this to the 1.2 milestone Sep 17, 2020

jreback approved these changes Sep 17, 2020

View reviewed changes

jreback merged commit b4d0ae5 into pandas-dev:master Sep 17, 2020

topper-123 deleted the StringArray_constrution_perf branch September 17, 2020 17:10

rhshadrach pushed a commit to rhshadrach/pandas that referenced this pull request Sep 17, 2020

PERF: StringArray construction (pandas-dev#36325)

d686536

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020

PERF: StringArray construction (pandas-dev#36325)

b3b061b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: StringArray construction #36325

PERF: StringArray construction #36325

topper-123 commented Sep 13, 2020 •

edited

Loading

jreback Sep 13, 2020

topper-123 Sep 13, 2020 •

edited

Loading

jorisvandenbossche Sep 13, 2020 •

edited

Loading

jbrockmendel commented Sep 14, 2020

TomAugspurger commented Sep 14, 2020

jorisvandenbossche commented Sep 14, 2020

TomAugspurger commented Sep 14, 2020

topper-123 commented Sep 15, 2020

topper-123 commented Sep 16, 2020

jreback commented Sep 17, 2020

PERF: StringArray construction #36325

PERF: StringArray construction #36325

Conversation

topper-123 commented Sep 13, 2020 • edited Loading

jreback Sep 13, 2020

Choose a reason for hiding this comment

topper-123 Sep 13, 2020 • edited Loading

Choose a reason for hiding this comment

jorisvandenbossche Sep 13, 2020 • edited Loading

Choose a reason for hiding this comment

jbrockmendel commented Sep 14, 2020

TomAugspurger commented Sep 14, 2020

jorisvandenbossche commented Sep 14, 2020

TomAugspurger commented Sep 14, 2020

topper-123 commented Sep 15, 2020

topper-123 commented Sep 16, 2020

jreback commented Sep 17, 2020

topper-123 commented Sep 13, 2020 •

edited

Loading

topper-123 Sep 13, 2020 •

edited

Loading

jorisvandenbossche Sep 13, 2020 •

edited

Loading