-
Notifications
You must be signed in to change notification settings - Fork 61
Series combine #821
base: master
Are you sure you want to change the base?
Series combine #821
Conversation
|
||
len_val = max(len(self), len(other)) | ||
result = numpy.empty(len_val, self._data.dtype) | ||
for ind in range(len_val): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we parallel the method based on chunks?
if fill_value is None: | ||
fill_value = numpy.nan | ||
|
||
len_val = max(len(self), len(other)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And what if all indexes are different? I think we should use sdc_join_series_indexes to find len of result series
|
||
len_val = max(len(self), len(other)) | ||
result = numpy.empty(len_val, self._data.dtype) | ||
for ind in range(len_val): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is case for non-indexes series. Also, it should rewrite with prange
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+
usage of chunks to predict scalability
examples/series/series_combine.py
Outdated
|
||
|
||
@njit | ||
def series_copy(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong name function
if fill_value is None: | ||
fill_value = numpy.nan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will make fill_value type undefined at compile time. You can probably use the same approach as in operators:
sdc/sdc/sdc_function_templates.py
Line 144 in e87095a
_fill_value = numpy.nan if fill_value_is_none == True else fill_value # noqa |
fill_value = numpy.nan | ||
|
||
len_val = max(len(self), len(other)) | ||
result = numpy.empty(len_val, self._data.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually wrong, result dtype should be common dtype for result dtype of func(a, b) where a,b are series values and dtype of _fill_value. Provided tests do not cover this, but e.g. this (where fill_value is float and series are integers) won't pass:
def test_series_combine_integer_new(self):
def test_impl(S1, S2):
return S1.combine(S2, lambda a, b: 2 * a + b, 16.2)
hpat_func = self.jit(test_impl)
S1 = pd.Series([1, 2, 3, 4, 5])
S2 = pd.Series([6, 21, 3, 5])
result = hpat_func(S1, S2)
result_ref = test_impl(S1, S2)
print(f"DEBUG: result:\n{result},\nresult_ref:\n{result_ref}")
pd.testing.assert_series_equal(result, result_ref)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add deducing result dtype and add parallelization.
sdc/tests/test_series.py
Outdated
@@ -2825,7 +2817,7 @@ def test_impl(S1, S2): | |||
S2 = pd.Series([1, 2, 3, 4, 5]) | |||
pd.testing.assert_series_equal(hpat_func(S1, S2), test_impl(S1, S2)) | |||
|
|||
@skip_numba_jit | |||
@unittest.expectedFailure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comment why the test is skipped.
@unittest.expectedFailure # ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rubtsowa No need to skip the test if that's how impl is intended to work. Use check_dtype=False in assert_series_equal and add a comment just before this check to refer to SDC Limitation.
if self_indexes[j] == -1: | ||
val_self = _fill_value | ||
else: | ||
ind_self = self_indexes[j] | ||
val_self = self[ind_self]._data[0] | ||
|
||
if other_indexes[j] == -1: | ||
val_other = _fill_value | ||
else: | ||
ind_other = other_indexes[j] | ||
val_other = other[ind_other]._data[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if self_indexes[j] == -1: | |
val_self = _fill_value | |
else: | |
ind_self = self_indexes[j] | |
val_self = self[ind_self]._data[0] | |
if other_indexes[j] == -1: | |
val_other = _fill_value | |
else: | |
ind_other = other_indexes[j] | |
val_other = other[ind_other]._data[0] | |
self_idx = self_indexes[j] | |
if self_idx == -1: | |
val_self = _fill_value | |
else: | |
val_self = self[self_idx]._data[0] | |
other_idx = other_indexes[j] | |
if other_idx == -1: | |
val_other = _fill_value | |
else: | |
val_other = other[other_idx]._data[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or
if self_indexes[j] == -1: | |
val_self = _fill_value | |
else: | |
ind_self = self_indexes[j] | |
val_self = self[ind_self]._data[0] | |
if other_indexes[j] == -1: | |
val_other = _fill_value | |
else: | |
ind_other = other_indexes[j] | |
val_other = other[ind_other]._data[0] | |
self_idx = self_indexes[j] | |
val_self = _fill_value if self_idx == -1 else self[self_idx]._data[0] | |
other_idx = other_indexes[j] | |
val_other = _fill_value if other_idx == -1 else other[other_idx]._data[0] |
def sdc_pandas_series_combine_impl(self, other, func, fill_value=None): | ||
|
||
if fill_value is None: | ||
fill_value = numpy.nan | ||
if fill_value is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if fill_value is not None: | |
_fill_value = numpy.nan if fill_value is None else fill_value: |
if self_idx == -1: | ||
val_self = _fill_value | ||
else: | ||
val_self = self[self_idx]._data[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self_idx (and other_idx) is position in the Series, not the index, so instead of using getitem on a Series, that performs index lookup and returns a Series, so that you have to take _data[0] from it, you can just write:
val_self = self[self_idx]._data[0] | |
val_self = self._data[self_idx] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks OK, but I would reorganize and extend existing tests a bit.
|
||
Limitations | ||
----------- | ||
- Only supports the case when data in series of the same type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is not correct - impl handles all cases. For the next line we need exact definition of difference to pandas, e.g:
- Only supports the case when data in series of the same type. | |
- Resulting series dtype may be wider than in pandas due to type-stability requirements and depends on fill_value dtype and result of series indexes alignment. |
sdc/tests/test_series.py
Outdated
@@ -2771,7 +2770,6 @@ def test_impl(S1, S2): | |||
S2 = pd.Series([6.0, 21., 3.6, 5.]) | |||
pd.testing.assert_series_equal(hpat_func(S1, S2), test_impl(S1, S2)) | |||
|
|||
@skip_numba_jit | |||
def test_series_combine_float3264(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test has incorrect code, which should be corrected probably:
S1 = pd.Series([np.float64(1), np.float64(2),
np.float64(3), np.float64(4), np.float64(5)])
S2 = pd.Series([np.float32(1), np.float32(2),
np.float32(3), np.float32(4), np.float32(5)])
S2.dtype will be float64 on Win, not float32. Moreover, series dtype should be specified this way:
S1 = pd.Series([1, 2, 3, 4, 5], dtype=np.int64)
S2 = pd.Series([1, 2, 3, 4, 5], dtype=np.int32)
chunks = parallel_chunks(len_val) | ||
for i in prange(len(chunks)): | ||
chunk = chunks[i] | ||
for j in range(chunk.start, chunk.stop): | ||
self_idx = self_indexes[j] | ||
val_self = _fill_value if self_idx == -1 else self._data[self_idx] | ||
|
||
other_idx = other_indexes[j] | ||
val_other = _fill_value if other_idx == -1 else other._data[other_idx] | ||
|
||
result[j] = func(val_self, val_other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not?
chunks = parallel_chunks(len_val) | |
for i in prange(len(chunks)): | |
chunk = chunks[i] | |
for j in range(chunk.start, chunk.stop): | |
self_idx = self_indexes[j] | |
val_self = _fill_value if self_idx == -1 else self._data[self_idx] | |
other_idx = other_indexes[j] | |
val_other = _fill_value if other_idx == -1 else other._data[other_idx] | |
result[j] = func(val_self, val_other) | |
result = numpy.empty(len_val, res_dtype) | |
for i in prange(len(result)): | |
self_idx, other_idx = self_indexes[i], other_indexes[i] | |
val_self = _fill_value if self_idx == -1 else self._data[self_idx] | |
val_other = _fill_value if other_idx == -1 else other._data[other_idx] | |
result[i] = func(val_self, val_other) |
sdc/tests/test_series.py
Outdated
def test_series_combine_assert1(self): | ||
def test_impl(S1, S2): | ||
return S1.combine(S2, lambda a, b: 2 * a + b) | ||
hpat_func = self.jit(test_impl) | ||
|
||
S1 = pd.Series([1, 2, 3]) | ||
S2 = pd.Series([6., 21., 3., 5.]) | ||
with self.assertRaises(AssertionError): | ||
hpat_func(S1, S2) | ||
pd.testing.assert_series_equal(hpat_func(S1, S2), test_impl(S1, S2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comment for tests: not all combinations of input series dtypes and fill_value are tested e.g. the one I mentioned before - where float fill_value is assigned to otherwise int series. There are no tests with series with non-default indexes (we refer to samelen, but it's not fully correct - series may have same len, but not same indexes), and no tests for checking func impact on result dtype, so it's hard to see from such tests what's really tested and what is not. So the suggestion is to organize tests in a different manner:
- product of diff series dtypes (default int, int64, float64),
same series indexes (but not same series sizes),
fill_value is specified and of different dtypes (None, np.nan, 4, 4.2)
Covers: test_series_combine_value_samelen - product of diff series dtypes (default int, int64, float64),
same series indexes (but not same series sizes),
with fill_value is omitted
Covers: test_series_combine_float3264, test_series_combine_integer_samelen, test_series_combine_samelen, test_series_combine_different_types - product of diff series dtypes (default int, int64, float64),
series indexes that align with and without -1 in indexers
fill_value is specified and of different dtypes (None, np.nan, 4, 4.2)
Covers: test_series_combine_integer, test_series_combine_value - product of diff series dtypes (default int, int64, float64),
series indexes that align with and without -1 in indexers
fill_value is omitted
Covers: test_series_combine, test_series_combine_assert1, test_series_combine_assert2, test_series_combine_different_types
New test:
5. (for testing func changes dtype properly)
product of diff series dtypes (default int, int64, float64),
same series indexes (but not same series sizes),
fill_value = 0
with diff functions (chaning and not chaning res dtype e.g. preserving int domain, e.g. ** and + and not, e.g. /)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, test 1 can look like this (it can also be split into two: one when we use check_dtype=False and one when we don't):
def test_series_combine_same_index_fill_value(self):
def test_impl(S1, S2):
return S1.combine(S2, lambda a, b: 2 * a + b)
hpat_func = self.jit(test_impl)
n = 11
np.random.seed(0)
A = np.random.randint(-100, 100, n)
B = np.arange(n) * 2 + 1
series_index = 1 + np.arange(n)
series_dtypes = [None, np.int64, np.float64]
fill_values = [None, np.nan, 4, 4.2]
for dtype1, dtype2, fill_value in product(series_dtypes, series_dtypes, fill_values):
S1 = pd.Series(A, index=series_index, dtype=dtype1)
S2 = pd.Series(B, index=series_index, dtype=dtype2)
with self.subTest(S1_dtype=dtype1, S2_dtype=dtype2, fill_value=fill_value):
result = hpat_func(S1, S2)
result_ref = test_impl(S1, S2)
# check_dtype=False due to difference to pandas in some cases
pd.testing.assert_series_equal(result, result_ref, check_dtype=False)
- Indixes should be strictly ascending, as inside the function | ||
they are sorted in ascending order and the answer becomes | ||
different from the result of the pandas. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rubtsowa What? This sounds like a bug...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kozlov-alexey This 'bug' in function sdc_join_series_indexes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rubtsowa Then it should be fixed (please create a JIRA case with a reproducer) before this is merged. Adapting to a bug is no way. @AlexanderKalistratov correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very old PR. It is better to close it.
No description provided.