Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: StringArray._from_sequence, use less memory #35519

Merged

Conversation

topper-123
Copy link
Contributor

@ldacey, can you try if this fixes your problem?

CC @simonjayhawkins

@@ -1698,6 +1698,20 @@ cpdef bint is_string_array(ndarray values, bint skipna=False):
return validator.validate(values)


cpdef ndarray ensure_string_array(ndarray values, object na_value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should return a new array
and pls
add a doc string

also i think we have a routine like this already

@simonjayhawkins simonjayhawkins added Performance Memory or execution speed performance Strings String extension data type and string data labels Aug 3, 2020
@simonjayhawkins simonjayhawkins added this to the 1.1.1 milestone Aug 3, 2020
@topper-123
Copy link
Contributor Author

topper-123 commented Aug 3, 2020

I looked at how the array in pd.Series(data, str) is created. There is a similar function that ensures a string array, but it's slower.

pd.Series(data, str) ends up calling dtypes.cast.construct_1d_ndarray_preserving_na, where curently all scalars are first converted to "<U" dtype, then they're converted back to object dtype (and str type) and finally it adds back the nan-likes, where appropriate. That's a lot of steps and in generally the same problem as I'm looking into for StringArray.

By using the new method in construct_1d_ndarray_preserving_na also, we avoid creating a lot of new objects, by doing a single iteration and reusing exisiting str objects and adding nan-likes in the same iteration rather than in a seperate step. This change means faster performance and lower memory usage:

>>> data = ["aaaaaa" for _ in range(201368)]
>>> %timeit pd.Series(data, dtype=str)
66.5 ms ± 656 µs per loop  # master
36.4 ms ± 330 µs per loop  # this PR
>>> %timeit pd.Series(data, dtype="string")
63.9 ms ± 832 µs per loop  # master
37.2 ms ± 459 µs per loop  # this PR

So a bit less than a doubling in performance.

This PR might in the new form now not only be a bug fix for #35499, but also a more fundamental change. Maybe the change to construct_1d_ndarray_preserving_na should go in a seperate PR that goes into v1.2?

values : array-like
The values to be converted to str, if needed
na_value : Any
The value to use for na. For example, np.nan or pd.NAN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pd.NA?

convert_na_value : bool, default True
If False, existing na values will be used unchanged in the new array
copy : bool, default True
Whether to wnsure that a new array is returned
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo + full stop

if convert_na_value:
for i in range(n):
val = result[i]
if not checknull(val):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we also need to account for pandas.options.mode.use_inf_as_na

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not super familiar with pandas.options.mode.use_inf_as_na, but it looks like it only treats inf as nan, but it does not convert, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, support looks patchy, but does work as I expected for Int64, maybe I misunderstood the functionality, but I do expect it to convert in operations such as astype or DataFrame construction.

>>> import numpy as np
>>> import pandas as pd
>>>
>>> pd.__version__
'1.2.0.dev0+27.g53f6b4711'
>>>
>>> with pd.option_context("mode.use_inf_as_na", True):
...     df = pd.DataFrame({"a": np.array([np.nan, np.inf, 1.0])})
...
>>> df
     a
0  NaN
1  inf
2  1.0
>>>
>>> df = pd.DataFrame({"a": np.array([np.nan, np.inf, 1.0])})
>>> with pd.option_context("mode.use_inf_as_na", True):
...     df = df.astype("float32")
...
>>> df
     a
0  NaN
1  inf
2  1.0
>>>
>>> with pd.option_context("mode.use_inf_as_na", True):
...     df = df.astype("category")
...
Traceback (most recent call last):
...
ValueError: Categorical categories cannot be null 
>>>
>>> df = pd.DataFrame({"a": np.array([np.nan, np.inf, 1.0])})
>>> with pd.option_context("mode.use_inf_as_na", True):
...     df = df.astype("Int64")
...
>>> df
      a
0  <NA>
1  <NA>
2     1
>>>
>>> df = pd.DataFrame({"a": np.array([np.nan, np.inf, 1.0])})
>>> with pd.option_context("mode.use_inf_as_na", False):
...     df = df.astype("Int64")
...
Traceback (most recent call last):
...
TypeError: cannot safely cast non-equivalent float64 to int64
>>> df
     a
0  NaN
1  inf
2  1.0
>>>
>>> df = pd.DataFrame({"a": np.array([np.nan, np.inf, "foo"])})
>>> with pd.option_context("mode.use_inf_as_na", False):
...     df = df.astype("string")
...
>>> df
     a
0  nan
1  inf
2  foo
>>>
>>> df = pd.DataFrame({"a": np.array([np.nan, np.inf, "foo"])})
>>> with pd.option_context("mode.use_inf_as_na", True):
...     df = df.astype("string")
...
>>> df
     a
0  nan
1  inf
2  foo
>>>
>>>

Copy link
Contributor Author

@topper-123 topper-123 Aug 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pd.options.mode.use_inf_as_na = True doesn't change inf to nan, so I'm not sure an astype should return a string or a nan. For example:

>>> pd.options.mode.use_inf_as_na = True
>>> x = pd.Series([np.inf, np.nan])
>>> x
0   NaN
1   NaN
dtype: float64
>>> x[0]
inf  # it's an inf, only the repr is a Nan it's and treated as nan in calculations...

Pandas 1.0 had not decided on how to convert infs to strings when pd.options.mode.use_inf_as_na = True:

>>> pd.options.mode.use_inf_as_na = True
>>> x = pd.Series([np.inf, np.nan])
>>> x.astype(str)
0    inf  # <- string "inf", not string "nan"
1    nan
dtype: object
>>> x.astype("string")
0    <NA>  # <- pd.NA
1    <NA>
dtype: string

Probably a choice of convention on how inf should be converted when use_inf_as_na = True needs to be decided on? In that case it'd be a different PR IMO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a choice of convention on how inf should be converted when use_inf_as_na = True needs to be decided on? In that case it'd be a different PR IMO.

sure

@simonjayhawkins
Copy link
Member

simonjayhawkins commented Aug 3, 2020

Maybe the change to construct_1d_ndarray_preserving_na should go in a seperate PR that goes into v1.2?

I think now that we are using semver, I'm happy including all the bugfixes in patch versions. I think that'll keep minor releases more focused on enhancements but would also mean more work maintaining the backport branch. let's see what others think.

@@ -1698,6 +1698,48 @@ cpdef bint is_string_array(ndarray values, bint skipna=False):
return validator.validate(values)


cpdef ndarray ensure_string_array(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like a generalization of: https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/lib.pyx#L621

pls de-dupe these as appropriate

Copy link
Contributor Author

@topper-123 topper-123 Aug 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

astype_str was only used in once, so I`ve deleted it and use ensure_string_array instead.

@topper-123 topper-123 force-pushed the refactor_StringArray._from_sequence branch from 0562d08 to 1337e7e Compare August 4, 2020 07:10
@jreback
Copy link
Contributor

jreback commented Aug 4, 2020

can u add a memory asv for this case

@topper-123
Copy link
Contributor Author

I'm not familiar with memory ASVS, only timing ones. Can you point to a file where I can one find? (I can then use that as a template for mine)

@jreback
Copy link
Contributor

jreback commented Aug 4, 2020

I'm not familiar with memory ASVS, only timing ones. Can you point to a file where I can one find? (I can then use that as a template for mine)

https://github.com/pandas-dev/pandas/blob/master/asv_bench/benchmarks/rolling.py#L24

subarr2 = subarr.astype(object)
subarr2[na_values] = np.asarray(values, dtype=object)[na_values]
subarr = subarr2
values = np.asarray(values, dtype="object")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont' actually need the np.asarray call here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

@topper-123 topper-123 force-pushed the refactor_StringArray._from_sequence branch from f69e74c to bfe387e Compare August 4, 2020 22:24
@topper-123
Copy link
Contributor Author

I've added ASV's. I'm however not able to run them on my machine (windows 10), so not what they show what I think they should show...Maybe someone could check it?

@jreback
Copy link
Contributor

jreback commented Aug 10, 2020

I've added ASV's. I'm however not able to run them on my machine (windows 10), so not what they show what I think they should show...Maybe someone could check it?

you can just run memits to assert that they are workign, no?

@jreback
Copy link
Contributor

jreback commented Aug 13, 2020

@topper-123 can you rebase and show the %memit on before / after

@topper-123
Copy link
Contributor Author

Ok, I've used memit, didn't know about that, it's nice.

Results:

>>> data = ["aaaaaa" for _ in range(201368)]
>>>
>>> %timeit pd.Series(data, dtype=str)
75.2 ms ± 725 µs per loop  # master
40.6 ms ± 389 µs per loop  # this PR
>>> %memit pd.Series(data, dtype=str)
peak memory: 100.96 MiB, increment: 18.87 MiB  # master
peak memory: 84.96 MiB, increment: 1.63 MiB  # this PR
>>>
>>> %timeit pd.Series(data, dtype="string")
70 ms ± 928 µs per loop  # master
41.9 ms ± 910 µs per loop  # this PR
>>> %memit pd.Series(data, dtype="string")
peak memory: 96.68 MiB, increment: 13.47 MiB  # master
peak memory: 85.25 MiB, increment: 1.55 MiB  # this PR

so a significant decrease in both time and memory usage.

the time and memory improvements are of course related, coming from avoiding the (simplified) pattern np.array(data, dtype=str).astype(object), that is used in master.

@topper-123 topper-123 force-pushed the refactor_StringArray._from_sequence branch from 52e53b1 to 47b5d69 Compare August 17, 2020 13:20
@jreback
Copy link
Contributor

jreback commented Aug 17, 2020

thanks @topper-123

@jreback jreback merged commit 3b0f23a into pandas-dev:master Aug 17, 2020
@simonjayhawkins
Copy link
Member

@meeseeksdev backport 1.1.x

tm.assert_numpy_array_equal(a, original)

expected = nan_arr if copy else na_arr
tm.assert_numpy_array_equal(nan_arr, expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this test is correctly changed. I think we should never mutate the input, whether copy is True or False (which is what it was testing before).
IMO the copy keyword is to indicate to simply always copy, or if False it will only copy when needed. And when you need to mutate, I would say the copy is needed.

It's also not taking a copy of the original array, so not even checking it wasn't changed (because even if it was changed, it would still compare equal to itself)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: pandas 1.1.0 MemoryError using .astype("string") which worked using pandas 1.0.5
4 participants