Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: Remove int64 uint64 float64 index part 1 #49560

Merged

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Nov 6, 2022

This PR makes progress towards removing Int64Index, UInt64Index & FloatIndex from the code base, see #42717.

In this PR I make instantiation of Index with numpy numeric dtypes return a NumericIndex with the given dtype rather than converting to the old 64 bit only numeric index types, making all numeric dtypes available for indexes.

This is just the first part of the changes planned. In follow-ups PRs I will:

related PRs

A part of the problems I had in #49494 was a few corner case bugs. They're pulled out into seperate PRs (#49536 & #49540). They're present in this PR at present, but will be removed after those seperate PRs have been merged.

Notable changes from this PR:

As mentioned in #49494, one notable change from the PR is that because we now have all the numeric dtypes available for indexes, indexes created from DateTimeIndex (day, year etc.) will now be in int32, where previously they were forced to int64 (because that was the only index integer dtype available).

Using int32 is the more correct way, because the underlying DateTimeArray returns 32bit arrays. An example of this:

>>> import pandas as pd
>>> 
>>> x = pd.date_range(start='1/1/2018', end='1/08/2018')
>>> x.array.day
array([1, 2, 3, 4, 5, 6, 7, 8], dtype=int32)  # 32bit, current and will not be changed in this series of PRs
>>> x.day
Int64Index([1, 2, 3, 4, 5, 6, 7, 8], dtype='int64')  # before this PR
NumericIndex([1, 2, 3, 4, 5, 6, 7, 8], dtype='int32')  # after this PR
Index([1, 2, 3, 4, 5, 6, 7, 8], dtype='int32')  # after follow-up PRs

The above will be included in the doc changes, when I do the docs

Progress towards #42717. See also #41272

@topper-123 topper-123 changed the title Remove int64 uint64 float64 index part 1 DEPR: Remove int64 uint64 float64 index part 1 Nov 6, 2022
@topper-123 topper-123 force-pushed the remove_int64_uint64_float64_index_part_1 branch from 844ab1b to 5ff5584 Compare November 6, 2022 23:42
@topper-123 topper-123 force-pushed the remove_int64_uint64_float64_index_part_1 branch from 0dc2660 to b332214 Compare November 7, 2022 05:26
@mroeschke mroeschke added the Deprecate Functionality to remove in pandas label Nov 7, 2022
@topper-123 topper-123 force-pushed the remove_int64_uint64_float64_index_part_1 branch from 527be72 to df7afae Compare November 8, 2022 12:42
@jbrockmendel
Copy link
Member

Out of curiosity: the approach I would have taken would have been to start with changing Index.__new__ to return NumericIndex instead of subclasses. Would that be counter-productive?

@topper-123
Copy link
Contributor Author

Out of curiosity: the approach I would have taken would have been to start with changing Index.__new__ to return NumericIndex instead of subclasses. Would that be counter-productive?

Yes, possibly. My main problem is that on my computer intp means int64, while on some machines intp means int32. Also there have been some smaller bugs that I've found working on this that slowed me down.

I try again tonight or tomorrow and I will try the Index.__new__ approach if I'm still having problems. Thanks for the suggestion.

@topper-123
Copy link
Contributor Author

Alright, it's all green finally! The failure in Python Dev / actions-311-dev (macOS-latest) (pull_request) says ERROR: Could not find a version that satisfies the requirement setuptools==59.2.0, so I assume unrelated to the PR).

This is quite a big PR, with lot of dtypes changes stemming from the fact that in this PR Index is able to return indexes with dtypes int8/int16/int32/uint8/uint16/uint32/float16/float32. There are some follow-ups after this as explained in the OP, but those should be much easier than this PR, because all the tricky dtype stuff is handled in this PR.

@topper-123 topper-123 force-pushed the remove_int64_uint64_float64_index_part_1 branch from 6655dfe to 4c337f6 Compare January 11, 2023 15:01
@topper-123
Copy link
Contributor Author

Updated after #50195 was merged + rebased. The remaining failure looks unrelated.

if self._is_backward_compat_public_numeric_index:
# this block is needed so e.g. NumericIndex[int8].astype("int32") returns
# NumericIndex[int32] and not Int64Index with dtype int64.
if not self._is_backward_compat_public_numeric_index and not isinstance(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this chunk of code is going to be removed in a follow-up right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, It will be removed in the followup #50052.

@@ -212,7 +217,15 @@ def test_series_from_coo(self, dtype, dense_index):

A = scipy.sparse.eye(3, format="coo", dtype=dtype)
result = pd.Series.sparse.from_coo(A, dense_index=dense_index)
index = pd.MultiIndex.from_tuples([(0, 0), (1, 1), (2, 2)])

index_dtype = np.int64 if dense_index else np.int32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it clear we want these to be different? if not, can you add a TODO comment to that effect

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I think they should be the same, but we need to decide which dtype should be used.

We have for both 32-bit and 64-bit systems:

>>> import scipy.sparse
>>> A = scipy.sparse.eye(3, format="coo", dtype=dtype)
>>> A.row, A.col
(array([0, 1, 2], dtype=int32), array([0, 1, 2], dtype=int32))

I.e. scipy.sparse always uses int32 for index, also on 64-bit systems. The pandas way OTOH is to default to 64-bit dtypes, if an index dtype hasn't been specified, also on 32-bit systems.

So the question is if we should use the pandas or the scipy convention for index dtype when calling pd.Series.sparse.from_coo.

My intuition is to follow the scipy way is this instance (i.e. use 32-bit), as pd.Series.sparse.from_coo is about converting an existing scipy data structure, so we should follow that data structure's conventions, where possible. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id add a TODO comment saying roughly what you said here and then punt on it

@@ -445,7 +445,8 @@ def DecimalArray__my_sum(self):
result = df.groupby("id")["decimals"].agg(lambda x: x.values.my_sum())
tm.assert_series_equal(result, expected, check_names=False)
s = pd.Series(DecimalArray(data))
result = s.groupby(np.array([0, 0, 0, 1, 1])).agg(lambda x: x.values.my_sum())
grouper = np.array([0, 0, 0, 1, 1], dtype=np.int64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think i asked elsewhere: specifying this is benign, just affects result.index.dtype or something like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this just affects result.index.dtype.

@@ -388,7 +388,7 @@ def test_to_csv_dup_cols(self, nrows):

@pytest.mark.slow
def test_to_csv_empty(self):
df = DataFrame(index=np.arange(10))
df = DataFrame(index=np.arange(10, dtype=np.int64))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is bc when we do to_csv followed by read_csv it doesnt roundtrip int32?

Copy link
Contributor Author

@topper-123 topper-123 Jan 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, read_csv can't know the dtypes of its read-in columns, so it gives dtype int64 to integer columns and indexes.

# add / sum
result = df.groupby(pd.cut(df["a"], grps), observed=observed)._cython_agg_general(
"sum", alt=None, numeric_only=True
)
intervals = pd.interval_range(0, 20, freq=5)
intervals = pd.IntervalIndex.from_breaks(np.arange(0, 21, 5))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought IntervalArray converted 32bit ints to 64?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is true. This changed line is from before IntervalIndex was ensured to be 64-bit, so this change is not needed anymore; the old and the new line are equivalent now. I'll revert this.

# https://github.com/twosigma/pandas/issues/53
dtype = np.dtype(any_int_numpy_dtype).type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we call this something other than dtype, since it isnt one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I can call it typ.

@@ -172,7 +172,7 @@ def test_nunique_preserves_column_level_names():
# GH 23222
test = DataFrame([1, 2, 2], columns=pd.Index(["A"], name="level_0"))
result = test.groupby([0, 0, 0]).nunique()
expected = DataFrame([2], columns=test.columns)
expected = DataFrame([2], index=np.array([0], dtype=np.int_), columns=test.columns)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the np.int_ here needed? ive been assuming that the index=np.array([0]) ive been seeing has been there to ensure the indexes get np.int_ dtype.

Copy link
Contributor Author

@topper-123 topper-123 Jan 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not needed anymore. This is from before #49815 was merged. I'll update it.

@topper-123
Copy link
Contributor Author

I'll wait for your response to the question about the pd.Series.sparse.from_coo question before I'll push an update for the PR.

@@ -87,7 +87,9 @@ def test_bins_from_interval_index_doc_example():
# Make sure we preserve the bins.
ages = np.array([10, 15, 13, 12, 23, 25, 28, 59, 60])
c = cut(ages, bins=[0, 18, 35, 70])
expected = IntervalIndex.from_tuples([(0, 18), (18, 35), (35, 70)])
expected = IntervalIndex.from_arrays(
left=np.array([0, 18, 35]), right=np.array([18, 35, 70])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a big deal but why is this necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, not needed after #50195. I'll revert.

[
([0, 1, 2, 3], [0, 0, 1, 1]),
([0], [0]),
*[
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i find this pattern difficult to read. is there a viable alternative?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a new param @pytest.mark.parametrize("dtype", [None] + tm.ALL_INT_NUMPY_DTYPES) and then

if dtype is not None:
    data = np.array(data, dtype=dtype)
    groups = np.array(groups, dtype=dtype)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can do that.

(timedelta_range("1 day", periods=10), "<m8[ns]"),
]
)
def breaks_and_expected_subtype(self, request):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the same as the one above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed entirely now. I'll update.

@@ -29,7 +29,8 @@ class TestIntervalRange:
@pytest.mark.parametrize("freq, periods", [(1, 100), (2.5, 40), (5, 20), (25, 4)])
def test_constructor_numeric(self, closed, name, freq, periods):
start, end = 0, 100
breaks = np.arange(101, step=freq)
dtype = np.int64 if is_integer(freq) else np.float64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as elsewhere, isnt this done automatically now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, after #50195... I'll revert.

Copy link
Member

@jbrockmendel jbrockmendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions, no blockers. LGTM cc @mroeschke

@topper-123
Copy link
Contributor Author

Updated. All comments have been addressed AFAIKT.

@mroeschke mroeschke merged commit d010c4a into pandas-dev:main Jan 13, 2023
@mroeschke
Copy link
Member

Awesome, thanks @topper-123

@jbrockmendel
Copy link
Member

this is huge, thanks @topper-123

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas Index Related to the Index class or subclasses
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants