-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT-#5423: Add a NumPy API to Modin #5422
Conversation
Signed-off-by: Rehan Durrani <rehan@ponder.io>
Signed-off-by: Bill Wang <billiam@ponder.io>
Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>
Signed-off-by: Rehan Durrani <rehan@ponder.io>
Signed-off-by: Bill Wang <billiam@ponder.io>
Signed-off-by: Rehan Durrani <rehan@ponder.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RehanSD is this ready for reviewing, or should this be considered a draft?
Not yet - will mark as draft! |
Signed-off-by: Rehan Durrani <rehan@ponder.io>
Signed-off-by: Rehan Durrani <rehan@ponder.io>
Signed-off-by: Rehan Durrani <rehan@ponder.io>
@vnlitvinov @devin-petersohn @noloerino would appreciate some insight regarding broadcast operations. Operations involving two 2-dimensional array that don't require broadcast follow pretty straightforwardly from Modin's QC code, but when we have a 1D object (object A) and a 2D object in a binary op, or a 2D object (with only one row) (object B) and a 2D object in a binary op, things get a little bit complicated, since we have to both understand the difference between object A and object B when displaying, but also when broadcasting - e.g. on add, object A and object B will broadcast the same, but on a dot product, they will not. I've included a preliminary approach to this in this PR, but would love folks' insight on this. We're also partially blocked here by #5529 . In [1]: import numpy as np
In [2]: arr = np.array([-1, 0, 1])
In [3]: matrix = np.array([[1, 2, 3], [4, 5, 6]])
In [4]: arr @ matrix
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [4], in <cell line: 1>()
----> 1 arr @ matrix
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 2 is different from 3)
In [5]: matrix @ arr
Out[5]: array([2, 2])
In [6]: arr * matrix
Out[6]:
array([[-1, 0, 3],
[-4, 0, 6]])
In [7]: arr = np.array([arr])
In [8]: arr @ matrix
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [8], in <cell line: 1>()
----> 1 arr @ matrix
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 2 is different from 3)
In [9]: matrix @ arr
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [9], in <cell line: 1>()
----> 1 matrix @ arr
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 1 is different from 3)
In [10]: arr * matrix
Out[10]:
array([[-1, 0, 3],
[-4, 0, 6]])
In [11]: arr + matrix
Out[11]:
array([[0, 2, 4],
[3, 5, 7]])
In [12]: np.array(arr[0]) + matrix
Out[12]:
array([[0, 2, 4],
[3, 5, 7]]) I guess my two questions are:
As a quick caveat, the dot broadcasting works correctly for the normal Modin QC, so in that case, we'd only need to figure out how to do the accounting to understand the difference between [1, 2, 3] and [[1, 2, 3]], and special case [[1, 2, 3]] to broadcast, since that's stored row-wise, and we store series column-wise, and broadcast correctly when it's a series. |
It seems like the actual numpy broadcasting rules (here) are not actually that complicated, especially since our arrays are never going to have more than 2 dimensions. The 2D -> 1D nested array example you give ([1, 2, 3] vs. [[1, 2, 3]]) would be covered by the case of the last dimension matching, as first array has shape That said, if the pandas query compiler implementation of broadcasting rules isn't working, then it might be worthwhile to just get the simplest cases working from within the numpy frontend prototype, before circling back to make a more robust fix. |
@RehanSD, don't we want to have a new hierarchy of the objects for NumPy API (NumPy API, NumpyQC, NumpyOnSmthDataframe, NumpyOnSmthPartitionManager, NumpyOnSmthDataframePartitions)? |
Not right now - for now, we'd like to try an API Layer that can use any backend query compiler, so we can hopefully get a bunch of NumPy functionality just based off of the existing QC functionality we have! |
@RehanSD, I think we should add the inner layers anyway as, otherwise, that would be kind of incorrect in terms of the logic. The users would use NumPy API but the storage format is pandas, for instance. It looks confusing. |
Signed-off-by: Rehan Durrani <rehan@ponder.io>
@RehanSD, if we do not want to add the inner layers right now, it seems to me that we should add some notes regarding the execution/processing to the docs. |
Signed-off-by: Rehan Durrani <rehan@ponder.io>
Signed-off-by: Rehan Durrani <rehan@ponder.io>
Signed-off-by: Rehan Durrani <rehan@ponder.io>
Signed-off-by: Rehan Durrani <rehan@ponder.io>
Code has been heavily updated since the last review - would appreciate a review on the new code!
def check_how_broadcast_to_output(arr_in: "array", arr_out: "array"): | ||
if not isinstance(arr_out, array): | ||
raise TypeError("return arrays must be of modin.numpy.array type.") | ||
if arr_out._ndim == arr_in._ndim and arr_out.shape != arr_in.shape: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this check is inaccurate, since it fails for a broadcastable case like a 4x2 w/ 4x1 (unless we're not yet supporting this)
>>> A = numpy.array([[0,1,2,3], [4,5,6,7]])
>>> B = numpy.array([[8,9,10,11]])
>>> A + B
array([[ 8, 10, 12, 14],
[12, 14, 16, 18]])
This would fall into the broadcasting rules case where the rightmost dimension of A and B, and the next dimension of B is 1 so that of A doesn't matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually never mind, I think I misunderstood the purpose of this function (I thought it was comparing the dimensions of two inputs of binary operators). Based on how this function is called, is it correct to assume that arr_in
is always the result of some computation, which may have already broadcast its inputs as necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup - this is just to see how we have to broadcast our arr_in to get it to fit in arr_out. (i.e. when out
is passed in to some API like add)
return result | ||
|
||
|
||
def find_common_dtype(dtypes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use numpy.common_type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numpy.common_type does not resolve dtypes how we'd like it to. Take the following example:
In [1]: import numpy
In [2]: import pandas
In [3]: df = pandas.DataFrame([[1, '2'], [3, '4']])
In [6]: numpy.find_common_type(df.dtypes.values, [])
Out[6]: dtype('O')
In [7]: numpy.array([[1, '2'], [3, '4']]).dtype
Out[7]: dtype('<U21')
the find_common_dtype
method I wrote uses numpy.promote_types under the hood, and gives us the correct dtype when we our query compiler has mixed types like the df in the example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And we don't use promote_types
directly since it only accepts two types at a time, so this method I wrote is literally just a wrapper that tree reduces on a list of dtypes.
def _get_shape(self): | ||
if self._ndim == 1: | ||
return (len(self._query_compiler.index),) | ||
return (len(self._query_compiler.index), len(self._query_compiler.columns)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you put this assert in? I don't see it in this version of the code.
modin/numpy/array_shaping.py
Outdated
ErrorMessage.single_warning( | ||
"Array order besides 'C' is not currently supported in Modin. Defaulting to 'C' order." | ||
) | ||
if hasattr(a, "flatten"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we discussed using a private variable on modin numpy arrays to check for whether to dispatch to the modin method or default to numpy. Is there a reason we're not taking that approach here, or that we're not defaulting to numpy and instead erroring if there's no flatten
attribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same applies for all other wrapper methods in array_shaping
, math
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check here is to ensure we're operating on a modin.numpy.array. If there's anything that isn't implemented in the modin.numpy.array, it won't be in the modin.numpy namespace - and if someone tries numpy.method(modin.numpy.array) that will either try array_function array_ufunc. This check is basically trying to assert that the object (a) is of type modin.numpy.array - I'll make this check more explicit.
modin/numpy/test/test_array.py
Outdated
numpy_result = numpy_arr.max(initial=0, where=False) | ||
assert modin_result == numpy_result | ||
with pytest.raises(ValueError): | ||
modin_result = modin_arr.max(out=modin_arr, keepdims=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
codeql warning here, i'm assuming this is just checking for the presence of an error, but if that's the case we should remove the assign to modin_result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense - will fix!
Signed-off-by: Rehan Durrani <rehan@ponder.io>
"""Module houses array creation methods for Modin's NumPy API.""" | ||
import numpy | ||
from modin.error_message import ErrorMessage | ||
from .arr import array |
Check notice
Code scanning / CodeQL
Cyclic import
out._query_compiler = where.where(result, out)._query_compiler | ||
return out | ||
elif not where: | ||
from .array_creation import zeros_like |
Check notice
Code scanning / CodeQL
Cyclic import
return array(_query_compiler=new_ufunc(*args, **kwargs), _ndim=out_ndim) | ||
|
||
def __array_function__(self, func, types, args, kwargs): | ||
from . import array_creation as creation, array_shaping as shaping, math |
Check notice
Code scanning / CodeQL
Cyclic import
import numpy | ||
|
||
from modin.error_message import ErrorMessage | ||
from .arr import array |
Check notice
Code scanning / CodeQL
Cyclic import
|
||
import numpy | ||
|
||
from .arr import array |
Check notice
Code scanning / CodeQL
Cyclic import
Signed-off-by: Rehan Durrani <rehan@ponder.io>
def _get_shape(self): | ||
if self._ndim == 1: | ||
return (len(self._query_compiler.index),) | ||
return (len(self._query_compiler.index), len(self._query_compiler.columns)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, that's fine
Signed-off-by: Rehan Durrani <rehan@ponder.io>
…x formatting issues) Signed-off-by: Rehan Durrani <rehan@ponder.io>
Signed-off-by: Rehan Durrani <rehan@ponder.io>
More or less LGTM. A few unresolved comments are left from my side. Please respond on them. |
Signed-off-by: Rehan Durrani <rehan@ponder.io>
Signed-off-by: Rehan Durrani <rehan@ponder.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @RehanSD !
What do these changes do?
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date