-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF: StringArray._from_sequence, use less memory #35519
REF: StringArray._from_sequence, use less memory #35519
Conversation
pandas/_libs/lib.pyx
Outdated
@@ -1698,6 +1698,20 @@ cpdef bint is_string_array(ndarray values, bint skipna=False): | |||
return validator.validate(values) | |||
|
|||
|
|||
cpdef ndarray ensure_string_array(ndarray values, object na_value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should return a new array
and pls
add a doc string
also i think we have a routine like this already
I looked at how the array in
By using the new method in >>> data = ["aaaaaa" for _ in range(201368)]
>>> %timeit pd.Series(data, dtype=str)
66.5 ms ± 656 µs per loop # master
36.4 ms ± 330 µs per loop # this PR
>>> %timeit pd.Series(data, dtype="string")
63.9 ms ± 832 µs per loop # master
37.2 ms ± 459 µs per loop # this PR So a bit less than a doubling in performance. This PR might in the new form now not only be a bug fix for #35499, but also a more fundamental change. Maybe the change to |
pandas/_libs/lib.pyx
Outdated
values : array-like | ||
The values to be converted to str, if needed | ||
na_value : Any | ||
The value to use for na. For example, np.nan or pd.NAN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pd.NA?
pandas/_libs/lib.pyx
Outdated
convert_na_value : bool, default True | ||
If False, existing na values will be used unchanged in the new array | ||
copy : bool, default True | ||
Whether to wnsure that a new array is returned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo + full stop
pandas/_libs/lib.pyx
Outdated
if convert_na_value: | ||
for i in range(n): | ||
val = result[i] | ||
if not checknull(val): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we also need to account for pandas.options.mode.use_inf_as_na
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not super familiar with pandas.options.mode.use_inf_as_na
, but it looks like it only treats inf as nan, but it does not convert, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, support looks patchy, but does work as I expected for Int64, maybe I misunderstood the functionality, but I do expect it to convert in operations such as astype or DataFrame construction.
>>> import numpy as np
>>> import pandas as pd
>>>
>>> pd.__version__
'1.2.0.dev0+27.g53f6b4711'
>>>
>>> with pd.option_context("mode.use_inf_as_na", True):
... df = pd.DataFrame({"a": np.array([np.nan, np.inf, 1.0])})
...
>>> df
a
0 NaN
1 inf
2 1.0
>>>
>>> df = pd.DataFrame({"a": np.array([np.nan, np.inf, 1.0])})
>>> with pd.option_context("mode.use_inf_as_na", True):
... df = df.astype("float32")
...
>>> df
a
0 NaN
1 inf
2 1.0
>>>
>>> with pd.option_context("mode.use_inf_as_na", True):
... df = df.astype("category")
...
Traceback (most recent call last):
...
ValueError: Categorical categories cannot be null
>>>
>>> df = pd.DataFrame({"a": np.array([np.nan, np.inf, 1.0])})
>>> with pd.option_context("mode.use_inf_as_na", True):
... df = df.astype("Int64")
...
>>> df
a
0 <NA>
1 <NA>
2 1
>>>
>>> df = pd.DataFrame({"a": np.array([np.nan, np.inf, 1.0])})
>>> with pd.option_context("mode.use_inf_as_na", False):
... df = df.astype("Int64")
...
Traceback (most recent call last):
...
TypeError: cannot safely cast non-equivalent float64 to int64
>>> df
a
0 NaN
1 inf
2 1.0
>>>
>>> df = pd.DataFrame({"a": np.array([np.nan, np.inf, "foo"])})
>>> with pd.option_context("mode.use_inf_as_na", False):
... df = df.astype("string")
...
>>> df
a
0 nan
1 inf
2 foo
>>>
>>> df = pd.DataFrame({"a": np.array([np.nan, np.inf, "foo"])})
>>> with pd.option_context("mode.use_inf_as_na", True):
... df = df.astype("string")
...
>>> df
a
0 nan
1 inf
2 foo
>>>
>>>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pd.options.mode.use_inf_as_na = True
doesn't change inf to nan, so I'm not sure an astype should return a string or a nan. For example:
>>> pd.options.mode.use_inf_as_na = True
>>> x = pd.Series([np.inf, np.nan])
>>> x
0 NaN
1 NaN
dtype: float64
>>> x[0]
inf # it's an inf, only the repr is a Nan it's and treated as nan in calculations...
Pandas 1.0 had not decided on how to convert infs to strings when pd.options.mode.use_inf_as_na = True
:
>>> pd.options.mode.use_inf_as_na = True
>>> x = pd.Series([np.inf, np.nan])
>>> x.astype(str)
0 inf # <- string "inf", not string "nan"
1 nan
dtype: object
>>> x.astype("string")
0 <NA> # <- pd.NA
1 <NA>
dtype: string
Probably a choice of convention on how inf should be converted when use_inf_as_na = True
needs to be decided on? In that case it'd be a different PR IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a choice of convention on how inf should be converted when
use_inf_as_na = True
needs to be decided on? In that case it'd be a different PR IMO.
sure
I think now that we are using semver, I'm happy including all the bugfixes in patch versions. I think that'll keep minor releases more focused on enhancements but would also mean more work maintaining the backport branch. let's see what others think. |
pandas/_libs/lib.pyx
Outdated
@@ -1698,6 +1698,48 @@ cpdef bint is_string_array(ndarray values, bint skipna=False): | |||
return validator.validate(values) | |||
|
|||
|
|||
cpdef ndarray ensure_string_array( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks like a generalization of: https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/lib.pyx#L621
pls de-dupe these as appropriate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
astype_str was only used in once, so I`ve deleted it and use ensure_string_array instead.
0562d08
to
1337e7e
Compare
can u add a memory asv for this case |
I'm not familiar with memory ASVS, only timing ones. Can you point to a file where I can one find? (I can then use that as a template for mine) |
https://github.com/pandas-dev/pandas/blob/master/asv_bench/benchmarks/rolling.py#L24 |
pandas/core/dtypes/cast.py
Outdated
subarr2 = subarr.astype(object) | ||
subarr2[na_values] = np.asarray(values, dtype=object)[na_values] | ||
subarr = subarr2 | ||
values = np.asarray(values, dtype="object") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dont' actually need the np.asarray call here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
f69e74c
to
bfe387e
Compare
I've added ASV's. I'm however not able to run them on my machine (windows 10), so not what they show what I think they should show...Maybe someone could check it? |
you can just run memits to assert that they are workign, no? |
@topper-123 can you rebase and show the |
Ok, I've used memit, didn't know about that, it's nice. Results: >>> data = ["aaaaaa" for _ in range(201368)]
>>>
>>> %timeit pd.Series(data, dtype=str)
75.2 ms ± 725 µs per loop # master
40.6 ms ± 389 µs per loop # this PR
>>> %memit pd.Series(data, dtype=str)
peak memory: 100.96 MiB, increment: 18.87 MiB # master
peak memory: 84.96 MiB, increment: 1.63 MiB # this PR
>>>
>>> %timeit pd.Series(data, dtype="string")
70 ms ± 928 µs per loop # master
41.9 ms ± 910 µs per loop # this PR
>>> %memit pd.Series(data, dtype="string")
peak memory: 96.68 MiB, increment: 13.47 MiB # master
peak memory: 85.25 MiB, increment: 1.55 MiB # this PR so a significant decrease in both time and memory usage. the time and memory improvements are of course related, coming from avoiding the (simplified) pattern |
52e53b1
to
47b5d69
Compare
thanks @topper-123 |
@meeseeksdev backport 1.1.x |
…35770) Co-authored-by: Terji Petersen <contribute@tensortable.com>
tm.assert_numpy_array_equal(a, original) | ||
|
||
expected = nan_arr if copy else na_arr | ||
tm.assert_numpy_array_equal(nan_arr, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this test is correctly changed. I think we should never mutate the input, whether copy is True or False (which is what it was testing before).
IMO the copy keyword is to indicate to simply always copy, or if False it will only copy when needed. And when you need to mutate, I would say the copy is needed.
It's also not taking a copy of the original array, so not even checking it wasn't changed (because even if it was changed, it would still compare equal to itself)
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
@ldacey, can you try if this fixes your problem?
CC @simonjayhawkins