-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement multi-column assignment in '__setitem__' (draft) #4326
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4326 +/- ##
==========================================
+ Coverage 86.71% 89.55% +2.84%
==========================================
Files 222 222
Lines 18038 18033 -5
==========================================
+ Hits 15642 16150 +508
+ Misses 2396 1883 -513
📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more |
a5795cd
to
add8505
Compare
modin/pandas/dataframe.py
Outdated
|
||
res = self._query_compiler.write_items( | ||
row_numeric_index=slice(None), | ||
col_numeric_index=self.columns.get_indexer_for(key), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we lose information about the name of the new columns. Looks like write_items
function doesn't fit.
modin/pandas/dataframe.py
Outdated
@@ -2541,6 +2541,10 @@ def __setitem__(self, key, value): | |||
self.insert(loc=len(self.columns), column=key, value=value) | |||
return | |||
|
|||
if isinstance(key, list) and is_list_like(value): | |||
self.loc[slice(None), key] = value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dchigarev have you thought about using loc/iloc
functions in __setitem__
function, does it make sense for me to continue working in this direction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using loc/iloc
may be a good approach if it supports forwarding slice(None)
or range(0, len(df))
row indexer right to the qc.write_items
. If it converts the indexer to a list of positions ([0, 1, 2, ..., len(df)]
along the way to qc.write_items
it would be much less performance beneficial, as the performance-wise complex modin_frame._get_dict_of_block_index
function would need to sequentially process a len(df)
indexer in case of a list-like one (for slice-like indexer it has a fast-path)
55e522f
to
500517c
Compare
@@ -1198,6 +1198,7 @@ def compute_part_size(indexer, remote_part, part_idx, axis): | |||
for row_idx, row_values in enumerate(row_partitions_list): | |||
row_blk_idx, row_internal_idx = row_values | |||
col_position_counter = 0 | |||
row_offset = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in some cases this variable was used before the definition
@@ -3066,7 +3066,7 @@ def write_items(self, row_numeric_index, col_numeric_index, broadcasted_items): | |||
Row positions to write value. | |||
col_numeric_index : list of ints | |||
Column positions to write value. | |||
broadcasted_items : 2D-array | |||
broadcasted_items : 2D-array or scalar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example: modin_df.iloc[[1, 2]] = 42
@dchigarev ready for review |
10acbc0
to
da167e7
Compare
@dchigarev ping |
da167e7
to
cab9467
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit worried about reindex
when inserting new labels inside .loc
. BTW, do we really need this inserting new indices/columns feature in .loc
for this PR? Maybe we should split it into two:
- Multi-column assignment in
setitem
(without covering insertions of new labels) - Fix insertion of new labels in
.loc
(we actually already tried to implement this in FIX-#3764: Ensure df.loc with a scalar out of bounds appends to df #3765, with no success though 😄)
# `setitem`` can change existing column/row and add new column/row; | ||
# In the case of adding new values, we should not use numerical indices, | ||
# but the original labels or we should update index separately and doing `reindex` | ||
new_qc = self.qc.setitem(axis, self.qc.get_axis(axis ^ 1)[lookup[0]], item) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's a problem with that?
new_qc = self.qc.setitem(axis, self.qc.get_axis(axis ^ 1)[lookup[0]], item) | |
new_qc = self.qc.setitem(axis, self.qc.get_axis(axis ^ 1)[lookup], item) |
if not isinstance(row_lookup, slice) | ||
else len(self.qc.index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why len(self.qc.index)
but not len(self.qc.index[row_lookup])
?
if len(column_values) != len(item.columns): | ||
# New value for columns/index make that reindex add NaN values | ||
item = item.reindex(index=index_values, columns=column_values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's an example of this case? for me, pandas raises an error if self.columns[col_lookup]
contains extra values in comparison with item.columns
:
>>> df = pandas.DataFrame({"a": [1, 2, 3], "b": [3, 4, 5], "c": [5, 4, 2]})
>>> item = pandas.DataFrame({"c": [4, 2, 1]})
>>> df.loc[: ["b", "c"]] = item
AssertionError: End slice bound is non-scalar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the following error in that case (for Modin and pandas): pandas.errors.InvalidIndexError: ['b', 'c']
. compute_lookup
will raise the exception in Modin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However behavior is different in: df[["b", "c"]] = item
.
if ( | ||
hashable(col_loc) | ||
and not self.qc.lazy_execution | ||
and len(self.qc.index) == 0 | ||
): | ||
# what if there are another columns? | ||
if is_list_like(item): | ||
# do we need these manipulation with item? | ||
if isinstance(item, (pandas.DataFrame, DataFrame)): | ||
item = item[item.columns[0]].values | ||
elif isinstance(item, np.ndarray): | ||
assert ( | ||
len(item.shape) < 3 | ||
), "Shape of new items must be compatible with manager shape" | ||
item = item.T.reshape(-1) | ||
if len(self.qc.index) > 0: | ||
item = item[: len(self.qc.index)] | ||
if not isinstance(item, (Series, Categorical, np.ndarray)): | ||
# why cast np.ndarray to list? | ||
item = list(item) | ||
|
||
new_self = DataFrame({col_loc: item}, columns=self.qc.columns) | ||
self.df._update_inplace(new_self._query_compiler) | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why to have this logic here? it looks pretty generic to put in into _LocationIndexerBase.__setitem__
new_self = DataFrame({col_loc: item}, columns=self.qc.columns) | ||
self.df._update_inplace(new_self._query_compiler) | ||
return | ||
axis = 0 if is_scalar(col_loc) else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be _determine_setitem_axis
responsibility to compute this, isn't it handles that case?
axis = 0 if is_scalar(col_loc) else None |
# do we need these manipulation with item? | ||
if isinstance(item, (pandas.DataFrame, DataFrame)): | ||
item = item[item.columns[0]].values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this manipulation? it should be enough to consider item
to be a new_self
, isn't it?
index = self.qc.index.append(missing_labels) | ||
self.qc = self.qc.reindex(labels=index, axis=0) | ||
self.df._update_inplace(new_query_compiler=self.qc) | ||
elif axis == 1: | ||
columns = self.qc.columns.append(missing_labels) | ||
self.qc = self.qc.reindex(labels=columns, axis=1) | ||
self.df._update_inplace(new_query_compiler=self.qc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reindexing is very heavy as it triggers full-axis concatenation, do we really need this all the time? For distributed-like item
we can call query_compiler.insert_item
, so maybe we should distribute item
first (if it's not yet) and then use insert_item
instead of reindexing the frame?
cab9467
to
c086c94
Compare
…tem__' Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Co-authored-by: Dmitry Chigarev <62142979+dchigarev@users.noreply.github.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
c086c94
to
16bea00
Compare
Signed-off-by: Anatoly Myachev <anatoliimyachev@mail.com>
We need to add both parts at the same time, since this case is now working, albeit slowly, but without errors (default to pandas). I'll try to separate the refactoring part. |
closing as already resolved by #5142 |
Signed-off-by: Dmitry Chigarev dmitry.chigarev@intel.com
What do these changes do?
loc
function for__setitem__
function, thus reducing the complexity of the code that performs the indexing.loc
function. All optimizations that were in__setitem__
function are now also moved to.axis==None
(whenaxis in {0, 1}
qc.setitem
function is executed). The following functions are executed on the path:compute_lookup
,broadcast_item
,write_items
. They are written with one fundamental limitation - adding objects is only possible if the corresponding labels (that used inkey
) are already present in the object. To get around this limitation, I have to do an explicit reindexing, which adds rows/columns with NaN values. In addition to the fact that we can do this operation directly inwrite_items
(I think so) much more efficient than it is now, it also leads to problems with changing columns dtype, which most likely does not correspond to the behavior of pandas.flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
df.__setitem__
#4325docs/development/architecture.rst
is up-to-date