FEAT-#5423: Add a NumPy API to Modin #5422

RehanSD · 2022-12-12T19:25:33Z

What do these changes do?

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves Add a NumPy API Layer to Modin #5423
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date

Signed-off-by: Rehan Durrani <rehan@ponder.io>

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

Signed-off-by: Bill Wang <billiam@ponder.io>

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

Signed-off-by: Rehan Durrani <rehan@ponder.io>

Signed-off-by: Bill Wang <billiam@ponder.io>

Signed-off-by: Rehan Durrani <rehan@ponder.io>

modin/numpy/arr.py

modin/numpy/__init__.py

modin/numpy/arr.py

vnlitvinov

@RehanSD is this ready for reviewing, or should this be considered a draft?

RehanSD · 2022-12-12T20:28:37Z

@RehanSD is this ready for reviewing, or should this be considered a draft?

Not yet - will mark as draft!

Signed-off-by: Rehan Durrani <rehan@ponder.io>

RehanSD · 2023-01-12T03:37:50Z

@vnlitvinov @devin-petersohn @noloerino would appreciate some insight regarding broadcast operations. Operations involving two 2-dimensional array that don't require broadcast follow pretty straightforwardly from Modin's QC code, but when we have a 1D object (object A) and a 2D object in a binary op, or a 2D object (with only one row) (object B) and a 2D object in a binary op, things get a little bit complicated, since we have to both understand the difference between object A and object B when displaying, but also when broadcasting - e.g. on add, object A and object B will broadcast the same, but on a dot product, they will not. I've included a preliminary approach to this in this PR, but would love folks' insight on this. We're also partially blocked here by #5529 .
Some examples:

In [1]: import numpy as np

In [2]: arr = np.array([-1, 0, 1])

In [3]: matrix = np.array([[1, 2, 3], [4, 5, 6]])

In [4]: arr @ matrix
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [4], in <cell line: 1>()
----> 1 arr @ matrix

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 2 is different from 3)

In [5]: matrix @ arr
Out[5]: array([2, 2])

In [6]: arr * matrix
Out[6]:
array([[-1,  0,  3],
       [-4,  0,  6]])

In [7]: arr = np.array([arr])

In [8]: arr @ matrix
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [8], in <cell line: 1>()
----> 1 arr @ matrix

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 2 is different from 3)

In [9]: matrix @ arr
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [9], in <cell line: 1>()
----> 1 matrix @ arr

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 1 is different from 3)

In [10]: arr * matrix
Out[10]:
array([[-1,  0,  3],
       [-4,  0,  6]])

In [11]: arr + matrix
Out[11]:
array([[0, 2, 4],
       [3, 5, 7]])

In [12]: np.array(arr[0]) + matrix
Out[12]:
array([[0, 2, 4],
       [3, 5, 7]])

I guess my two questions are:

Is this something we want to tackle now, or should the prototype just assume we're broadcasting with nice shapes (i.e. only 1D objects and 2D objects, and no nested 1D objects)
What do folks' think of this approach? This is a modified version of @noloerino's approach here for context.

As a quick caveat, the dot broadcasting works correctly for the normal Modin QC, so in that case, we'd only need to figure out how to do the accounting to understand the difference between [1, 2, 3] and [[1, 2, 3]], and special case [[1, 2, 3]] to broadcast, since that's stored row-wise, and we store series column-wise, and broadcast correctly when it's a series.

noloerino · 2023-01-12T05:03:14Z

It seems like the actual numpy broadcasting rules (here) are not actually that complicated, especially since our arrays are never going to have more than 2 dimensions. The 2D -> 1D nested array example you give ([1, 2, 3] vs. [[1, 2, 3]]) would be covered by the case of the last dimension matching, as first array has shape (3,) and the second has shape (1, 3). I imagine we wouldn't support any further nested arrays.

That said, if the pandas query compiler implementation of broadcasting rules isn't working, then it might be worthwhile to just get the simplest cases working from within the numpy frontend prototype, before circling back to make a more robust fix.

YarShev · 2023-01-12T09:12:22Z

@RehanSD, don't we want to have a new hierarchy of the objects for NumPy API (NumPy API, NumpyQC, NumpyOnSmthDataframe, NumpyOnSmthPartitionManager, NumpyOnSmthDataframePartitions)?

RehanSD · 2023-01-12T20:32:18Z

@RehanSD, don't we want to have a new hierarchy of the objects for NumPy API (NumPy API, NumpyQC, NumpyOnSmthDataframe, NumpyOnSmthPartitionManager, NumpyOnSmthDataframePartitions)?

Not right now - for now, we'd like to try an API Layer that can use any backend query compiler, so we can hopefully get a bunch of NumPy functionality just based off of the existing QC functionality we have!

YarShev · 2023-01-12T23:23:36Z

@RehanSD, I think we should add the inner layers anyway as, otherwise, that would be kind of incorrect in terms of the logic. The users would use NumPy API but the storage format is pandas, for instance. It looks confusing.

Signed-off-by: Rehan Durrani <rehan@ponder.io>

YarShev · 2023-01-12T23:27:33Z

@RehanSD, if we do not want to add the inner layers right now, it seems to me that we should add some notes regarding the execution/processing to the docs.

Signed-off-by: Rehan Durrani <rehan@ponder.io>

Code has been heavily updated since the last review - would appreciate a review on the new code!

modin/numpy/__init__.py

modin/numpy/arr.py

noloerino · 2023-02-06T23:22:08Z

modin/numpy/arr.py

+def check_how_broadcast_to_output(arr_in: "array", arr_out: "array"):
+    if not isinstance(arr_out, array):
+        raise TypeError("return arrays must be of modin.numpy.array type.")
+    if arr_out._ndim == arr_in._ndim and arr_out.shape != arr_in.shape:


this check is inaccurate, since it fails for a broadcastable case like a 4x2 w/ 4x1 (unless we're not yet supporting this)

>>> A = numpy.array([[0,1,2,3], [4,5,6,7]]) >>> B = numpy.array([[8,9,10,11]]) >>> A + B array([[ 8, 10, 12, 14], [12, 14, 16, 18]])

This would fall into the broadcasting rules case where the rightmost dimension of A and B, and the next dimension of B is 1 so that of A doesn't matter.

Actually never mind, I think I misunderstood the purpose of this function (I thought it was comparing the dimensions of two inputs of binary operators). Based on how this function is called, is it correct to assume that arr_in is always the result of some computation, which may have already broadcast its inputs as necessary?

Yup - this is just to see how we have to broadcast our arr_in to get it to fit in arr_out. (i.e. when out is passed in to some API like add)

modin/numpy/arr.py

noloerino · 2023-02-06T23:34:06Z

modin/numpy/arr.py

+    return result
+
+
+def find_common_dtype(dtypes):


Why not use numpy.common_type?

numpy.common_type does not resolve dtypes how we'd like it to. Take the following example:

In [1]: import numpy In [2]: import pandas In [3]: df = pandas.DataFrame([[1, '2'], [3, '4']]) In [6]: numpy.find_common_type(df.dtypes.values, []) Out[6]: dtype('O') In [7]: numpy.array([[1, '2'], [3, '4']]).dtype Out[7]: dtype('<U21')

the find_common_dtype method I wrote uses numpy.promote_types under the hood, and gives us the correct dtype when we our query compiler has mixed types like the df in the example.

And we don't use promote_types directly since it only accepts two types at a time, so this method I wrote is literally just a wrapper that tree reduces on a list of dtypes.

noloerino · 2023-02-06T23:41:35Z

modin/numpy/arr.py

+    def _get_shape(self):
+        if self._ndim == 1:
+            return (len(self._query_compiler.index),)
+        return (len(self._query_compiler.index), len(self._query_compiler.columns))


Did you put this assert in? I don't see it in this version of the code.

noloerino · 2023-02-06T23:44:28Z

modin/numpy/array_shaping.py

+        ErrorMessage.single_warning(
+            "Array order besides 'C' is not currently supported in Modin. Defaulting to 'C' order."
+        )
+    if hasattr(a, "flatten"):


I believe we discussed using a private variable on modin numpy arrays to check for whether to dispatch to the modin method or default to numpy. Is there a reason we're not taking that approach here, or that we're not defaulting to numpy and instead erroring if there's no flatten attribute?

Same applies for all other wrapper methods in array_shaping, math, etc.

The check here is to ensure we're operating on a modin.numpy.array. If there's anything that isn't implemented in the modin.numpy.array, it won't be in the modin.numpy namespace - and if someone tries numpy.method(modin.numpy.array) that will either try array_function array_ufunc. This check is basically trying to assert that the object (a) is of type modin.numpy.array - I'll make this check more explicit.

modin/numpy/constants.py

modin/numpy/math.py

noloerino · 2023-02-06T23:46:52Z

modin/numpy/test/test_array.py

+    numpy_result = numpy_arr.max(initial=0, where=False)
+    assert modin_result == numpy_result
+    with pytest.raises(ValueError):
+        modin_result = modin_arr.max(out=modin_arr, keepdims=True)


codeql warning here, i'm assuming this is just checking for the presence of an error, but if that's the case we should remove the assign to modin_result

Makes sense - will fix!

Signed-off-by: Rehan Durrani <rehan@ponder.io>

modin/numpy/arr.py

modin/numpy/array_creation.py

+"""Module houses array creation methods for Modin's NumPy API."""
+import numpy
+from modin.error_message import ErrorMessage
+from .arr import array


modin/numpy/arr.py

+        out._query_compiler = where.where(result, out)._query_compiler
+        return out
+    elif not where:
+        from .array_creation import zeros_like


modin/numpy/arr.py

+        return array(_query_compiler=new_ufunc(*args, **kwargs), _ndim=out_ndim)
+
+    def __array_function__(self, func, types, args, kwargs):
+        from . import array_creation as creation, array_shaping as shaping, math


modin/numpy/array_shaping.py

+import numpy
+
+from modin.error_message import ErrorMessage
+from .arr import array


modin/numpy/math.py

+
+import numpy
+
+from .arr import array


Signed-off-by: Rehan Durrani <rehan@ponder.io>

modin/numpy/arr.py

noloerino · 2023-02-07T00:50:58Z

modin/numpy/arr.py

+    def _get_shape(self):
+        if self._ndim == 1:
+            return (len(self._query_compiler.index),)
+        return (len(self._query_compiler.index), len(self._query_compiler.columns))


sure, that's fine

Signed-off-by: Rehan Durrani <rehan@ponder.io>

modin/pandas/series.py

modin/numpy/test/test_array.py

modin/numpy/array_shaping.py

modin/numpy/array_creation.py

modin/numpy/arr.py

…x formatting issues) Signed-off-by: Rehan Durrani <rehan@ponder.io>

Signed-off-by: Rehan Durrani <rehan@ponder.io>

modin/pandas/series.py

YarShev · 2023-02-08T13:24:15Z

More or less LGTM. A few unresolved comments are left from my side. Please respond on them.

#5422 (comment)

Signed-off-by: Rehan Durrani <rehan@ponder.io>

modin/utils.py

Signed-off-by: Rehan Durrani <rehan@ponder.io>

devin-petersohn

LGTM, thanks @RehanSD !

modin/utils.py

RehanSD requested a review from a team as a code owner December 12, 2022 19:25

RehanSD changed the title ~~FEAT-XXXX: Add a NumPy API to Modin~~ FEAT-5423: Add a NumPy API to Modin Dec 12, 2022

RehanSD and others added 8 commits December 12, 2022 11:39

FEAT-modin-project#5423: Begin implementing NumPy API Layer

6f2a6d7

Signed-off-by: Rehan Durrani <rehan@ponder.io>

Start

7a4fa99

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

Next

2a08cf0

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

Added absolute, abs, add, all, subtract to modin.numpy

4b68f50

Signed-off-by: Bill Wang <billiam@ponder.io>

Add changes

0b915b4

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

Add shape + reshape

9c7a66b

Signed-off-by: Rehan Durrani <rehan@ponder.io>

Added additional math functions for numpy

1c6d708

Signed-off-by: Bill Wang <billiam@ponder.io>

Add list constructor

30171d2

Signed-off-by: Rehan Durrani <rehan@ponder.io>

RehanSD force-pushed the numpy/init branch from d342e36 to 30171d2 Compare December 12, 2022 19:41

lint

ab0ecdb

Signed-off-by: Rehan Durrani <rehan@ponder.io>

github-advanced-security bot found potential problems Dec 12, 2022

View reviewed changes

vnlitvinov reviewed Dec 12, 2022

View reviewed changes

RehanSD marked this pull request as draft December 12, 2022 20:29

RehanSD changed the title ~~FEAT-5423: Add a NumPy API to Modin~~ FEAT-#5423: Add a NumPy API to Modin Dec 17, 2022

RehanSD added 3 commits January 11, 2023 19:15

Add dimension handling

25510bc

Signed-off-by: Rehan Durrani <rehan@ponder.io>

Merge remote-tracking branch 'upstream/master' into numpy/init

4ef400f

Signed-off-by: Rehan Durrani <rehan@ponder.io>

Fix partial broadcasting issues

43e3bb5

Signed-off-by: Rehan Durrani <rehan@ponder.io>

Add testing

4301b9d

Signed-off-by: Rehan Durrani <rehan@ponder.io>

Add tests to CI

5ceca02

Signed-off-by: Rehan Durrani <rehan@ponder.io>

RehanSD marked this pull request as ready for review January 12, 2023 23:39

RehanSD requested a review from vnlitvinov January 12, 2023 23:39

RehanSD added 3 commits February 6, 2023 10:40

Add defensive dimension check

52f0928

Signed-off-by: Rehan Durrani <rehan@ponder.io>

Fix auto-cast issue

cfaa066

Signed-off-by: Rehan Durrani <rehan@ponder.io>

Fix CI bug

f9be32d

Signed-off-by: Rehan Durrani <rehan@ponder.io>

noloerino suggested changes Feb 6, 2023

View reviewed changes

Address review comments

48967d8

Signed-off-by: Rehan Durrani <rehan@ponder.io>

github-advanced-security bot found potential problems Feb 7, 2023

View reviewed changes

Fix type computation and add check for where

5b1da61

Signed-off-by: Rehan Durrani <rehan@ponder.io>

noloerino suggested changes Feb 7, 2023

View reviewed changes

Fix auto broadcast of out variable

74ed3a2

Signed-off-by: Rehan Durrani <rehan@ponder.io>

RehanSD requested a review from noloerino February 7, 2023 00:58

noloerino previously approved these changes Feb 7, 2023

View reviewed changes

YarShev reviewed Feb 7, 2023

View reviewed changes

Address review comments (break up testing into multiple files, and fi…

3244b79

…x formatting issues) Signed-off-by: Rehan Durrani <rehan@ponder.io>

RehanSD dismissed noloerino’s stale review via 3244b79 February 7, 2023 23:11

Fix naming

a3d57fe

Signed-off-by: Rehan Durrani <rehan@ponder.io>

YarShev reviewed Feb 8, 2023

View reviewed changes

modin/pandas/series.py Show resolved Hide resolved

Add to_numpy

18588b8

Signed-off-by: Rehan Durrani <rehan@ponder.io>

github-advanced-security bot found potential problems Feb 8, 2023

View reviewed changes

modin/utils.py Fixed Show fixed Hide fixed

modin/utils.py Fixed Show fixed Hide fixed

Add warning about numpy api and fix lint

19acf64

Signed-off-by: Rehan Durrani <rehan@ponder.io>

RehanSD requested review from devin-petersohn, noloerino and YarShev February 8, 2023 20:29

Fix lint

1bf6c00

Signed-off-by: Rehan Durrani <rehan@ponder.io>

noloerino approved these changes Feb 8, 2023

View reviewed changes

devin-petersohn approved these changes Feb 9, 2023

View reviewed changes

YarShev reviewed Feb 9, 2023

View reviewed changes

modin/utils.py Show resolved Hide resolved

YarShev approved these changes Feb 9, 2023

View reviewed changes

YarShev merged commit e73fb4f into modin-project:master Feb 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT-#5423: Add a NumPy API to Modin #5422

FEAT-#5423: Add a NumPy API to Modin #5422

RehanSD commented Dec 12, 2022 •

edited by dchigarev

Loading

vnlitvinov left a comment

RehanSD commented Dec 12, 2022

RehanSD commented Jan 12, 2023 •

edited

Loading

noloerino commented Jan 12, 2023

YarShev commented Jan 12, 2023 •

edited

Loading

RehanSD commented Jan 12, 2023

YarShev commented Jan 12, 2023

YarShev commented Jan 12, 2023

noloerino Feb 6, 2023

noloerino Feb 6, 2023

RehanSD Feb 6, 2023 •

edited

Loading

noloerino Feb 6, 2023

RehanSD Feb 6, 2023

RehanSD Feb 7, 2023

RehanSD Feb 7, 2023

noloerino Feb 6, 2023

noloerino Feb 6, 2023

noloerino Feb 6, 2023

RehanSD Feb 7, 2023

noloerino Feb 6, 2023

RehanSD Feb 6, 2023

noloerino Feb 7, 2023

YarShev commented Feb 8, 2023 •

edited

Loading

devin-petersohn left a comment

FEAT-#5423: Add a NumPy API to Modin #5422

FEAT-#5423: Add a NumPy API to Modin #5422

Conversation

RehanSD commented Dec 12, 2022 • edited by dchigarev Loading

What do these changes do?

vnlitvinov left a comment

Choose a reason for hiding this comment

RehanSD commented Dec 12, 2022

RehanSD commented Jan 12, 2023 • edited Loading

noloerino commented Jan 12, 2023

YarShev commented Jan 12, 2023 • edited Loading

RehanSD commented Jan 12, 2023

YarShev commented Jan 12, 2023

YarShev commented Jan 12, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RehanSD Feb 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YarShev commented Feb 8, 2023 • edited Loading

devin-petersohn left a comment

Choose a reason for hiding this comment

RehanSD commented Dec 12, 2022 •

edited by dchigarev

Loading

RehanSD commented Jan 12, 2023 •

edited

Loading

YarShev commented Jan 12, 2023 •

edited

Loading

RehanSD Feb 6, 2023 •

edited

Loading

YarShev commented Feb 8, 2023 •

edited

Loading