Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT-#5423: Add a NumPy API to Modin #5422

Merged
merged 43 commits into from
Feb 9, 2023

Conversation

RehanSD
Copy link
Collaborator

@RehanSD RehanSD commented Dec 12, 2022

What do these changes do?

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves Add a NumPy API Layer to Modin #5423
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

@RehanSD RehanSD requested a review from a team as a code owner December 12, 2022 19:25
@RehanSD RehanSD changed the title FEAT-XXXX: Add a NumPy API to Modin FEAT-5423: Add a NumPy API to Modin Dec 12, 2022
RehanSD and others added 8 commits December 12, 2022 11:39
Signed-off-by: Rehan Durrani <rehan@ponder.io>
Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>
Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>
Signed-off-by: Bill Wang <billiam@ponder.io>
Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>
Signed-off-by: Rehan Durrani <rehan@ponder.io>
Signed-off-by: Bill Wang <billiam@ponder.io>
Signed-off-by: Rehan Durrani <rehan@ponder.io>
Signed-off-by: Rehan Durrani <rehan@ponder.io>
modin/numpy/arr.py Fixed Show fixed Hide fixed
modin/numpy/__init__.py Fixed Show fixed Hide fixed
modin/numpy/__init__.py Fixed Show resolved Hide resolved
modin/numpy/__init__.py Fixed Show fixed Hide fixed
modin/numpy/arr.py Fixed Show fixed Hide fixed
modin/numpy/arr.py Fixed Show fixed Hide fixed
Copy link
Collaborator

@vnlitvinov vnlitvinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RehanSD is this ready for reviewing, or should this be considered a draft?

@RehanSD
Copy link
Collaborator Author

RehanSD commented Dec 12, 2022

@RehanSD is this ready for reviewing, or should this be considered a draft?

Not yet - will mark as draft!

@RehanSD RehanSD marked this pull request as draft December 12, 2022 20:29
@RehanSD RehanSD changed the title FEAT-5423: Add a NumPy API to Modin FEAT-#5423: Add a NumPy API to Modin Dec 17, 2022
Signed-off-by: Rehan Durrani <rehan@ponder.io>
Signed-off-by: Rehan Durrani <rehan@ponder.io>
Signed-off-by: Rehan Durrani <rehan@ponder.io>
@RehanSD
Copy link
Collaborator Author

RehanSD commented Jan 12, 2023

@vnlitvinov @devin-petersohn @noloerino would appreciate some insight regarding broadcast operations. Operations involving two 2-dimensional array that don't require broadcast follow pretty straightforwardly from Modin's QC code, but when we have a 1D object (object A) and a 2D object in a binary op, or a 2D object (with only one row) (object B) and a 2D object in a binary op, things get a little bit complicated, since we have to both understand the difference between object A and object B when displaying, but also when broadcasting - e.g. on add, object A and object B will broadcast the same, but on a dot product, they will not. I've included a preliminary approach to this in this PR, but would love folks' insight on this. We're also partially blocked here by #5529 .
Some examples:

In [1]: import numpy as np

In [2]: arr = np.array([-1, 0, 1])

In [3]: matrix = np.array([[1, 2, 3], [4, 5, 6]])

In [4]: arr @ matrix
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [4], in <cell line: 1>()
----> 1 arr @ matrix

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 2 is different from 3)

In [5]: matrix @ arr
Out[5]: array([2, 2])

In [6]: arr * matrix
Out[6]:
array([[-1,  0,  3],
       [-4,  0,  6]])

In [7]: arr = np.array([arr])

In [8]: arr @ matrix
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [8], in <cell line: 1>()
----> 1 arr @ matrix

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 2 is different from 3)

In [9]: matrix @ arr
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [9], in <cell line: 1>()
----> 1 matrix @ arr

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 1 is different from 3)

In [10]: arr * matrix
Out[10]:
array([[-1,  0,  3],
       [-4,  0,  6]])

In [11]: arr + matrix
Out[11]:
array([[0, 2, 4],
       [3, 5, 7]])

In [12]: np.array(arr[0]) + matrix
Out[12]:
array([[0, 2, 4],
       [3, 5, 7]])

I guess my two questions are:

  1. Is this something we want to tackle now, or should the prototype just assume we're broadcasting with nice shapes (i.e. only 1D objects and 2D objects, and no nested 1D objects)
  2. What do folks' think of this approach? This is a modified version of @noloerino's approach here for context.

As a quick caveat, the dot broadcasting works correctly for the normal Modin QC, so in that case, we'd only need to figure out how to do the accounting to understand the difference between [1, 2, 3] and [[1, 2, 3]], and special case [[1, 2, 3]] to broadcast, since that's stored row-wise, and we store series column-wise, and broadcast correctly when it's a series.

@noloerino
Copy link
Collaborator

It seems like the actual numpy broadcasting rules (here) are not actually that complicated, especially since our arrays are never going to have more than 2 dimensions. The 2D -> 1D nested array example you give ([1, 2, 3] vs. [[1, 2, 3]]) would be covered by the case of the last dimension matching, as first array has shape (3,) and the second has shape (1, 3). I imagine we wouldn't support any further nested arrays.

That said, if the pandas query compiler implementation of broadcasting rules isn't working, then it might be worthwhile to just get the simplest cases working from within the numpy frontend prototype, before circling back to make a more robust fix.

@YarShev
Copy link
Collaborator

YarShev commented Jan 12, 2023

@RehanSD, don't we want to have a new hierarchy of the objects for NumPy API (NumPy API, NumpyQC, NumpyOnSmthDataframe, NumpyOnSmthPartitionManager, NumpyOnSmthDataframePartitions)?

@RehanSD
Copy link
Collaborator Author

RehanSD commented Jan 12, 2023

@RehanSD, don't we want to have a new hierarchy of the objects for NumPy API (NumPy API, NumpyQC, NumpyOnSmthDataframe, NumpyOnSmthPartitionManager, NumpyOnSmthDataframePartitions)?

Not right now - for now, we'd like to try an API Layer that can use any backend query compiler, so we can hopefully get a bunch of NumPy functionality just based off of the existing QC functionality we have!

@YarShev
Copy link
Collaborator

YarShev commented Jan 12, 2023

@RehanSD, I think we should add the inner layers anyway as, otherwise, that would be kind of incorrect in terms of the logic. The users would use NumPy API but the storage format is pandas, for instance. It looks confusing.

Signed-off-by: Rehan Durrani <rehan@ponder.io>
@YarShev
Copy link
Collaborator

YarShev commented Jan 12, 2023

@RehanSD, if we do not want to add the inner layers right now, it seems to me that we should add some notes regarding the execution/processing to the docs.

Signed-off-by: Rehan Durrani <rehan@ponder.io>
@RehanSD RehanSD marked this pull request as ready for review January 12, 2023 23:39
Signed-off-by: Rehan Durrani <rehan@ponder.io>
Signed-off-by: Rehan Durrani <rehan@ponder.io>
Signed-off-by: Rehan Durrani <rehan@ponder.io>
@RehanSD RehanSD dismissed vnlitvinov’s stale review February 6, 2023 23:45

Code has been heavily updated since the last review - would appreciate a review on the new code!

modin/numpy/__init__.py Outdated Show resolved Hide resolved
modin/numpy/arr.py Outdated Show resolved Hide resolved
def check_how_broadcast_to_output(arr_in: "array", arr_out: "array"):
if not isinstance(arr_out, array):
raise TypeError("return arrays must be of modin.numpy.array type.")
if arr_out._ndim == arr_in._ndim and arr_out.shape != arr_in.shape:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check is inaccurate, since it fails for a broadcastable case like a 4x2 w/ 4x1 (unless we're not yet supporting this)

>>> A = numpy.array([[0,1,2,3], [4,5,6,7]])
>>> B = numpy.array([[8,9,10,11]])
>>> A + B
array([[ 8, 10, 12, 14],
       [12, 14, 16, 18]])

This would fall into the broadcasting rules case where the rightmost dimension of A and B, and the next dimension of B is 1 so that of A doesn't matter.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually never mind, I think I misunderstood the purpose of this function (I thought it was comparing the dimensions of two inputs of binary operators). Based on how this function is called, is it correct to assume that arr_in is always the result of some computation, which may have already broadcast its inputs as necessary?

Copy link
Collaborator Author

@RehanSD RehanSD Feb 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup - this is just to see how we have to broadcast our arr_in to get it to fit in arr_out. (i.e. when out is passed in to some API like add)

modin/numpy/arr.py Outdated Show resolved Hide resolved
return result


def find_common_dtype(dtypes):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use numpy.common_type?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numpy.common_type does not resolve dtypes how we'd like it to. Take the following example:

In [1]: import numpy

In [2]: import pandas

In [3]: df = pandas.DataFrame([[1, '2'], [3, '4']])

In [6]: numpy.find_common_type(df.dtypes.values, [])
Out[6]: dtype('O')

In [7]: numpy.array([[1, '2'], [3, '4']]).dtype
Out[7]: dtype('<U21')

the find_common_dtype method I wrote uses numpy.promote_types under the hood, and gives us the correct dtype when we our query compiler has mixed types like the df in the example.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we don't use promote_types directly since it only accepts two types at a time, so this method I wrote is literally just a wrapper that tree reduces on a list of dtypes.

def _get_shape(self):
if self._ndim == 1:
return (len(self._query_compiler.index),)
return (len(self._query_compiler.index), len(self._query_compiler.columns))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you put this assert in? I don't see it in this version of the code.

ErrorMessage.single_warning(
"Array order besides 'C' is not currently supported in Modin. Defaulting to 'C' order."
)
if hasattr(a, "flatten"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we discussed using a private variable on modin numpy arrays to check for whether to dispatch to the modin method or default to numpy. Is there a reason we're not taking that approach here, or that we're not defaulting to numpy and instead erroring if there's no flatten attribute?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same applies for all other wrapper methods in array_shaping, math, etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check here is to ensure we're operating on a modin.numpy.array. If there's anything that isn't implemented in the modin.numpy.array, it won't be in the modin.numpy namespace - and if someone tries numpy.method(modin.numpy.array) that will either try array_function array_ufunc. This check is basically trying to assert that the object (a) is of type modin.numpy.array - I'll make this check more explicit.

modin/numpy/constants.py Show resolved Hide resolved
modin/numpy/math.py Show resolved Hide resolved
numpy_result = numpy_arr.max(initial=0, where=False)
assert modin_result == numpy_result
with pytest.raises(ValueError):
modin_result = modin_arr.max(out=modin_arr, keepdims=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

codeql warning here, i'm assuming this is just checking for the presence of an error, but if that's the case we should remove the assign to modin_result

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense - will fix!

Signed-off-by: Rehan Durrani <rehan@ponder.io>
modin/numpy/arr.py Fixed Show resolved Hide resolved
"""Module houses array creation methods for Modin's NumPy API."""
import numpy
from modin.error_message import ErrorMessage
from .arr import array

Check notice

Code scanning / CodeQL

Cyclic import

Import of module [modin.numpy.arr](1) begins an import cycle.
out._query_compiler = where.where(result, out)._query_compiler
return out
elif not where:
from .array_creation import zeros_like

Check notice

Code scanning / CodeQL

Cyclic import

Import of module [modin.numpy.array_creation](1) begins an import cycle.
return array(_query_compiler=new_ufunc(*args, **kwargs), _ndim=out_ndim)

def __array_function__(self, func, types, args, kwargs):
from . import array_creation as creation, array_shaping as shaping, math

Check notice

Code scanning / CodeQL

Cyclic import

Import of module [modin.numpy.array_creation](1) begins an import cycle. Import of module [modin.numpy.array_shaping](2) begins an import cycle. Import of module [modin.numpy.math](3) begins an import cycle.
import numpy

from modin.error_message import ErrorMessage
from .arr import array

Check notice

Code scanning / CodeQL

Cyclic import

Import of module [modin.numpy.arr](1) begins an import cycle.

import numpy

from .arr import array

Check notice

Code scanning / CodeQL

Cyclic import

Import of module [modin.numpy.arr](1) begins an import cycle.
Signed-off-by: Rehan Durrani <rehan@ponder.io>
modin/numpy/arr.py Fixed Show resolved Hide resolved
def _get_shape(self):
if self._ndim == 1:
return (len(self._query_compiler.index),)
return (len(self._query_compiler.index), len(self._query_compiler.columns))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, that's fine

Signed-off-by: Rehan Durrani <rehan@ponder.io>
noloerino
noloerino previously approved these changes Feb 7, 2023
modin/pandas/series.py Show resolved Hide resolved
modin/numpy/test/test_array.py Show resolved Hide resolved
modin/numpy/test/test_array.py Outdated Show resolved Hide resolved
modin/numpy/test/test_array.py Outdated Show resolved Hide resolved
modin/numpy/test/test_array.py Outdated Show resolved Hide resolved
modin/numpy/array_shaping.py Outdated Show resolved Hide resolved
modin/numpy/array_creation.py Outdated Show resolved Hide resolved
modin/numpy/arr.py Show resolved Hide resolved
modin/numpy/arr.py Outdated Show resolved Hide resolved
modin/numpy/arr.py Show resolved Hide resolved
…x formatting issues)

Signed-off-by: Rehan Durrani <rehan@ponder.io>
Signed-off-by: Rehan Durrani <rehan@ponder.io>
@YarShev
Copy link
Collaborator

YarShev commented Feb 8, 2023

More or less LGTM. A few unresolved comments are left from my side. Please respond on them.

#5422 (comment)

#5422 (comment)

#5422 (comment)

#5422 (comment)

Signed-off-by: Rehan Durrani <rehan@ponder.io>
modin/utils.py Fixed Show fixed Hide fixed
modin/utils.py Fixed Show fixed Hide fixed
Signed-off-by: Rehan Durrani <rehan@ponder.io>
Signed-off-by: Rehan Durrani <rehan@ponder.io>
Copy link
Collaborator

@devin-petersohn devin-petersohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @RehanSD !

modin/utils.py Show resolved Hide resolved
@YarShev YarShev merged commit e73fb4f into modin-project:master Feb 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a NumPy API Layer to Modin
9 participants