FEAT-#4931: Create a query compiler that can connect to a service #4932

devin-petersohn · 2022-09-06T14:20:37Z

Signed-off-by: Devin Petersohn devin.petersohn@gmail.com

What do these changes do?

commit message follows format outlined here
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves Create a Query Compiler that can connect to a service #4931
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date
added (Issue Number: PR title (PR Number)) and github username to release notes for next major release

codecov · 2022-09-06T20:06:11Z

Codecov Report

Merging #4932 (aa31871) into master (0a2c0de) will decrease coverage by 16.27%.
The diff coverage is 36.48%.

❗ Current head aa31871 differs from pull request most recent head 6477955. Consider uploading reports for the commit 6477955 to get more accurate results

@@             Coverage Diff             @@
##           master    #4932       +/-   ##
===========================================
- Coverage   84.98%   68.71%   -16.28%     
===========================================
  Files         253      256        +3     
  Lines       19113    19841      +728     
===========================================
- Hits        16243    13633     -2610     
- Misses       2870     6208     +3338

Impacted Files	Coverage Δ
modin/core/execution/client/io.py	`0.00% <0.00%> (ø)`
modin/core/execution/client/query_compiler.py	`36.06% <36.06%> (ø)`
modin/config/envvars.py	`78.77% <66.66%> (-6.54%)`	⬇️
modin/pandas/indexing.py	`90.45% <66.66%> (-0.97%)`	⬇️
.../core/execution/dispatching/factories/factories.py	`86.66% <71.42%> (-1.24%)`	⬇️
modin/pandas/base.py	`95.06% <93.75%> (-0.25%)`	⬇️
modin/pandas/series.py	`93.79% <100.00%> (-0.24%)`	⬇️
...odin/experimental/core/storage_formats/__init__.py	`0.00% <0.00%> (-100.00%)`	⬇️
...din/experimental/core/execution/native/__init__.py	`0.00% <0.00%> (-100.00%)`	⬇️
.../experimental/core/storage_formats/hdk/__init__.py	`0.00% <0.00%> (-100.00%)`	⬇️
... and 64 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

vnlitvinov · 2022-09-07T06:01:14Z

@devin-petersohn @pyrito CI is still red

lgtm-com · 2022-09-16T03:28:53Z

This pull request introduces 1 alert when merging b21b1fd into 170e5de - view on LGTM.com

new alerts:

1 for Unused import

pyrito · 2022-09-16T13:51:10Z

modin/core/execution/client/io.py

+            if filepath_or_buffer.startswith("file://"):
+                # We will do this so that the backend can know whether this
+                # is a path or a URL.
+                filepath_or_buffer = filepath_or_buffer[7:]


You can save fsspec.open(filepath_or_buffer) and call path to get the location within the schema. That is probably the cleaner solution here.

Instead of dealing with this here, why not have the server handle fsspec paths?

@pyrito for local paths, we don't want to include the file://, but for non-local paths like s3:// we do want the prefix

lgtm-com · 2022-09-16T17:50:32Z

This pull request introduces 1 alert when merging 8ba13c7 into f727c04 - view on LGTM.com

new alerts:

1 for Unused import

modin/config/envvars.py

modin/core/execution/client/io.py

modin/core/execution/client/query_compiler.py

modin/pandas/base.py

modin/core/execution/client/query_compiler.py

pyrito

This is looking really good @mvashishtha . Thanks for all the hard work on this. I've added a couple of small nit fixes and asked a couple questions.

modin/core/execution/client/container.py

modin/core/storage_formats/base/query_compiler.py

modin/core/storage_formats/pandas/query_compiler.py

pyrito · 2022-10-29T18:31:40Z

modin/core/execution/client/container.py

+def _set_forwarding_groupby_method(method_name: str):
+    """
+    Define a groupby method that forwards arguments to an inner query compiler.
+
+    Parameters
+    ----------
+    method_name : str
+    """
+
+    def forwarding_method(self, id, by_is_qc, by, *args, **kwargs):
+        if by_is_qc:
+            by = self._qc[by]
+        new_id = self._generate_id()
+        self._qc[new_id] = getattr(self._qc[id], method_name)(by, *args, **kwargs)
+        return new_id
+
+    setattr(ForwardingQueryCompilerContainer, method_name, forwarding_method)


Just wanted to say that this is a super clever way of forwarding methods using getattr and setattr.

+1. Saves many lines of code!

modin/core/execution/client/io.py

modin/core/execution/client/query_compiler.py

modin/pandas/base.py

modin/conftest.py

modin/core/storage_formats/base/query_compiler.py

modin/pandas/base.py

modin/pandas/series_utils.py

modin/pandas/test/dataframe/test_default.py

modin/pandas/test/dataframe/test_map_metadata.py

helmeleegy · 2022-11-01T01:08:55Z

LGTM overall! Thanks, Mahesh for all the hard work!

lgtm-com · 2022-11-01T23:58:29Z

This pull request introduces 1 alert when merging b28b83d4f7c18865d39ca64200cc1e5ae5aca5de into a93399c - view on LGTM.com

new alerts:

1 for Wrong number of arguments in a call

…a service Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

Co-authored-by: Karthik Velayutham <karthik.velayutham@gmail.com>

Signed-off-by: mvashishtha <mahesh@ponder.io>

…t one multiindexing case Signed-off-by: mvashishtha <mahesh@ponder.io>

Signed-off-by: mvashishtha <mahesh@ponder.io>

anmyachev · 2022-11-10T14:22:53Z

modin/experimental/core/storage_formats/hdk/query_compiler.py

@@ -687,7 +691,7 @@ def reset_index(self, **kwargs):
            self._modin_frame.reset_index(drop), shape_hint=shape_hint
        )

-    def astype(self, col_dtypes, **kwargs):
+    def astype(self, col_dtypes, errors: str):


Why has it changed?

The way the new client query compiler converts types, we can't tell whether things will error until we actually try to cast types within the query compiler. So the query compiler has to know what do with errors. The other query compilers, including base, pandas, and HDK, can ignore the errors. I did the same with drop.

I would rather one would extract this to a separate PR

modin/core/execution/client/container.py

modin/pandas/__init__.py

modin/pandas/test/dataframe/test_default.py

AndreyPavlenko · 2022-11-10T21:40:19Z

modin/core/execution/client/io.py

+        cls._data_conn = conn
+
+    @classmethod
+    def read_csv(cls, filepath_or_buffer, **kwargs):


If my understanding is correct, this is a network service, that can read arbitrary file in the file system. It seems like a security issue. I think, absolute file paths must be restricted here. Only paths, relative to the user defined directories must be allowed.

@AndreyPavlenko that's a good question to raise. This implementation doesn't directly imply a network service in my opinion (any query compiler can be used in the service model). Restrictions on what types of files used is perhaps out of scope for this PR, but something we should potentially consider in the future.

Sure, I don't insist on implementing it in this PR. I think, some configuration capability could be added, to restrict the arbitrary file reading.

Signed-off-by: mvashishtha <mahesh@ponder.io>

vnlitvinov

I went through some of the files, still have to cover others, but I would rather publish part of feedback earlier.

vnlitvinov · 2022-11-11T16:33:59Z

modin/config/envvars.py



 class StorageFormat(EnvironmentVariable, type=str):
    """Engine to run on a single node of distribution."""

    varname = "MODIN_STORAGE_FORMAT"
    default = "Pandas"
-    choices = ("Pandas", "Hdk", "Pyarrow", "Cudf")
+    choices = ("Pandas", "Hdk", "Pyarrow", "Cudf", "")


I would rather we not use an empty value, as it's impossible in some shells to even set such a variable.

vnlitvinov · 2022-11-11T17:37:22Z

modin/core/execution/client/container.py

+    class DefaultToPandasResult(NamedTuple):
+        """
+        The result of ``default_to_pandas``.
+
+        Parameters
+        ----------
+        result : Any
+            The result of the operation.
+        result_is_qc_id : bool
+            Whether the result is a query compiler ID.
+        """
+
+        result: Any
+        result_is_qc_id: bool


is there a reason for it to be enclosed in a class?

why not @dataclass?

vnlitvinov · 2022-11-11T17:38:37Z

modin/core/execution/client/container.py

+        result_is_qc_id = isinstance(result, self._query_compiler_class)
+        if result_is_qc_id:
+            new_id = self._generate_id()
+            self._qc[new_id] = result


when compilers are removed? as far as I see, this is an ever-lasting memory leak right now

vnlitvinov · 2022-11-11T17:40:52Z

modin/core/execution/client/container.py

+        id,
+        to_replace_is_qc: bool,
+        regex_is_qc: bool,
+        to_replace,
+        value,
+        inplace,
+        limit,
+        regex,
+        method,


Y u no type hints?

vnlitvinov · 2022-11-11T17:47:57Z

modin/core/execution/client/container.py

+        if by_is_qc:
+            by = self._qc[by]


I see this all the time here... is there any reason to pass an explicit bool instead of doing stuff like

if isinstance(by, UUID): by = self._qc[by]

?

vnlitvinov · 2022-11-11T18:02:07Z

modin/core/io/file_dispatcher.py

+                },
+                kwargs.get("errors", "raise"),


this feels like a sneak fix for something unrelated to the PR

vnlitvinov · 2022-11-11T18:03:12Z

modin/core/storage_formats/base/query_compiler.py

+    @doc_utils.doc_binary_method(
+        operation="multiplication", sign="*", self_on_right=True
+    )
+    def rmul(self, other, **kwargs):  # noqa: PR02
+        return BinaryDefault.register(pandas.DataFrame.rmul)(
+            self, other=other, **kwargs
+        )
+


sneak feature, unrelated to the PR?

vnlitvinov · 2022-11-11T18:03:27Z

modin/core/storage_formats/base/query_compiler.py

-    def astype(self, col_dtypes, **kwargs):  # noqa: PR02
+    def astype(self, col_dtypes, errors: str):
        """
        Convert columns dtypes to given dtypes.

        Parameters
        ----------
        col_dtypes : dict
            Map for column names and new dtypes.
-        **kwargs : dict
-            Serves the compatibility purpose. Does not affect the result.
+        errors : {"raise", "ignore"}
+            Control raising of exceptions on invalid data for provided dtype.

        Returns
        -------
        BaseQueryCompiler
            New QueryCompiler with updated dtypes.
        """
        return DataFrameDefault.register(pandas.DataFrame.astype)(
-            self, dtype=col_dtypes, **kwargs
+            self, dtype=col_dtypes, errors=errors


sneak fix 🙃 here and below

vnlitvinov · 2022-11-11T18:04:21Z

modin/core/storage_formats/base/query_compiler.py

+    def take_2d_labels(
+        self,
+        index,
+        columns,
+    ):


why is this function added in this PR?

vnlitvinov · 2022-11-11T18:06:07Z

modin/experimental/core/storage_formats/hdk/query_compiler.py

@@ -687,7 +691,7 @@ def reset_index(self, **kwargs):
            self._modin_frame.reset_index(drop), shape_hint=shape_hint
        )

-    def astype(self, col_dtypes, **kwargs):
+    def astype(self, col_dtypes, errors: str):


I would rather one would extract this to a separate PR

vnlitvinov · 2022-11-15T16:25:42Z

modin/pandas/base.py

+                # In case of lazy execution we should bypass these error checking components
+                # because they can force the materialization of the row or column labels.
+                if self._query_compiler.lazy_execution:
+                    continue


will we raise errors later on then?

vnlitvinov · 2022-11-15T16:27:02Z

modin/pandas/indexing.py

+
+        return self.qc.take_2d(row_lookup, col_lookup)
+
+    def _get_pandas_object_from_qc_view(


why is this refactored here? seems like mixing different changes in single PR

vnlitvinov · 2022-11-15T16:27:20Z

modin/pandas/indexing.py

+        # If not every element of the key is a scalar, e.g. the key is
+        # (slice(None), 0), then the key isn't a full key-lookup, and the
+        # entire key behaves more like a slice than like a scalar.
+        return (
+            isinstance(key, tuple)
+            and len(key) == len(multiindex.levels)
+            and all(is_scalar(k) for k in key)
+        )


this feels like a sneak fix of a bug unrelated to this PR, let's not mix things

modin/pandas/series_utils.py

vnlitvinov · 2022-11-15T16:29:34Z

modin/pandas/test/dataframe/test_indexing.py

+    condition=get_current_execution() == "Client",
+    reason=(
+        "client query compiler uses lazy execution, so we don't default "
+        + "to pandas for the empty frame because we don't check whether the frame is empty. we can't do the insertion correctly right now without defaulting to pandas."


Suggested change

+ "to pandas for the empty frame because we don't check whether the frame is empty. we can't do the insertion correctly right now without defaulting to pandas."

+ "to pandas for the empty frame because we don't check whether the frame is empty. "

+ "We can't do the insertion correctly right now without defaulting to pandas."

vnlitvinov · 2022-11-15T16:31:49Z

modin/pandas/test/dataframe/test_map_metadata.py

+    eval_general(
+        modin_simple,
+        simple,
+        lambda df: df.drop(5),
+        check_exception_type=check_exception_type,
+    )
+    eval_general(
+        modin_simple,
+        simple,
+        lambda df: df.drop("C", axis=1),
+        check_exception_type=check_exception_type,
+    )
+    eval_general(
+        modin_simple,
+        simple,
+        lambda df: df.drop([1, 5], axis=1),
+        check_exception_type=check_exception_type,
+    )
+    eval_general(
+        modin_simple,
+        simple,
+        lambda df: df.drop(["A", "C"], axis=1),
+        check_exception_type=check_exception_type,
+    )


Suggested change

eval_general(

modin_simple,

simple,

lambda df: df.drop(5),

check_exception_type=check_exception_type,

)

eval_general(

modin_simple,

simple,

lambda df: df.drop("C", axis=1),

check_exception_type=check_exception_type,

)

eval_general(

modin_simple,

simple,

lambda df: df.drop([1, 5], axis=1),

check_exception_type=check_exception_type,

)

eval_general(

modin_simple,

simple,

lambda df: df.drop(["A", "C"], axis=1),

check_exception_type=check_exception_type,

)

for func in [lambda df: df.drop(5), lambda df: df.drop("C", axis=1), lambda df: df.drop([1, 5], axis=1), lambda df: df.drop(["A", "C"], axis=1)]:

eval_general(

modin_simple,

simple,

func,

check_exception_type=check_exception_type,

)

probably needs black-formatting, though

vnlitvinov · 2022-11-15T16:32:08Z

modin/pandas/test/dataframe/test_map_metadata.py

-    # errors = 'ignore'
+    # test errors = 'ignore'


vnlitvinov · 2022-11-23T11:57:02Z

@devin-petersohn @mvashishtha do we still need this PR? I think it can be closed for now.

devin-petersohn requested a review from a team as a code owner September 6, 2022 14:20

pyrito reviewed Sep 16, 2022

View reviewed changes

pyrito force-pushed the service/init branch 3 times, most recently from 9a13b94 to d74cdf0 Compare September 23, 2022 17:40

mvashishtha reviewed Sep 26, 2022

View reviewed changes

modin/core/execution/client/query_compiler.py Outdated Show resolved Hide resolved

mvashishtha force-pushed the service/init branch 2 times, most recently from dfce918 to a4ccc7e Compare October 29, 2022 02:28

mvashishtha requested a review from a team as a code owner October 29, 2022 03:52

mvashishtha force-pushed the service/init branch from c467ec2 to 8d19d4b Compare October 29, 2022 05:38

pyrito reviewed Oct 29, 2022

View reviewed changes

helmeleegy reviewed Nov 1, 2022

View reviewed changes

modin/conftest.py Outdated Show resolved Hide resolved

helmeleegy reviewed Nov 1, 2022

View reviewed changes

modin/core/storage_formats/base/query_compiler.py Outdated Show resolved Hide resolved

helmeleegy reviewed Nov 1, 2022

View reviewed changes

modin/pandas/base.py Show resolved Hide resolved

helmeleegy reviewed Nov 1, 2022

View reviewed changes

modin/pandas/series_utils.py Show resolved Hide resolved

helmeleegy reviewed Nov 1, 2022

View reviewed changes

modin/pandas/test/dataframe/test_default.py Show resolved Hide resolved

helmeleegy reviewed Nov 1, 2022

View reviewed changes

modin/pandas/test/dataframe/test_map_metadata.py Outdated Show resolved Hide resolved

mvashishtha force-pushed the service/init branch from 48ca937 to 93e066b Compare November 8, 2022 19:22

devin-petersohn and others added 5 commits November 8, 2022 16:32

FEAT-modin-project#4931: Create a query compiler that can connect to …

1392493

…a service Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

Fixes to pass CI + docs for io.py

3797403

Update implementation

dd0e7a5

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

Fix some things

026a91c

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

Lint fixes

ea0ac1d

mvashishtha and others added 13 commits November 8, 2022 16:32

Apply suggestions from code review

8826c54

Co-authored-by: Karthik Velayutham <karthik.velayutham@gmail.com>

Address comments.

584ef10

Signed-off-by: mvashishtha <mahesh@ponder.io>

Respond to comments.

1871154

Signed-off-by: mvashishtha <mahesh@ponder.io>

Fix fuzzydata by making getitem_row_array use numeric=True everywhere.

1707390

Signed-off-by: mvashishtha <mahesh@ponder.io>

Pass errors through astype.

e7af275

Signed-off-by: mvashishtha <mahesh@ponder.io>

Fix astype errors.

edf99f8

Signed-off-by: mvashishtha <mahesh@ponder.io>

Use new take_2d_labels for most insertion. test_indexing passes excep…

e74db7c

…t one multiindexing case Signed-off-by: mvashishtha <mahesh@ponder.io>

Actually use client query compiler.

b616c28

Signed-off-by: mvashishtha <mahesh@ponder.io>

Fix multiindex and fix doc_checker.

832556d

Signed-off-by: mvashishtha <mahesh@ponder.io>

Fix IO astype bug.

51fd254

Signed-off-by: mvashishtha <mahesh@ponder.io>

Make ClientIO use ClientQueryCompiler by default.

ebe2719

Signed-off-by: mvashishtha <mahesh@ponder.io>

Debug read_sql.

f205801

Signed-off-by: mvashishtha <mahesh@ponder.io>

Fix getitem_row_array.

92ad6dd

Signed-off-by: mvashishtha <mahesh@ponder.io>

mvashishtha force-pushed the service/init branch from f7c8a91 to 92ad6dd Compare November 8, 2022 22:32

mvashishtha added 5 commits November 9, 2022 10:14

Fix black and flake8, and add a comment.

1d7e494

Signed-off-by: mvashishtha <mahesh@ponder.io>

Fix getitem_row_array.

8be834a

Signed-off-by: mvashishtha <mahesh@ponder.io>

Fix getitem_row_array again.

61a7aad

Signed-off-by: mvashishtha <mahesh@ponder.io>

Fix bugs that showed up in CI.

07ad5c5

Signed-off-by: mvashishtha <mahesh@ponder.io>

Fix a multiindex Client bug, and fix an hdk astype bug.

38ef127

Signed-off-by: mvashishtha <mahesh@ponder.io>

anmyachev reviewed Nov 10, 2022

View reviewed changes

AndreyPavlenko reviewed Nov 10, 2022

View reviewed changes

modin/core/execution/client/container.py Show resolved Hide resolved

modin/pandas/__init__.py Outdated Show resolved Hide resolved

modin/pandas/test/dataframe/test_default.py Show resolved Hide resolved

AndreyPavlenko reviewed Nov 10, 2022

View reviewed changes

Respond to comments.

ead877e

Signed-off-by: mvashishtha <mahesh@ponder.io>

mvashishtha approved these changes Nov 11, 2022

View reviewed changes

vnlitvinov reviewed Nov 11, 2022

View reviewed changes

vnlitvinov reviewed Nov 15, 2022

View reviewed changes

mvashishtha marked this pull request as draft November 16, 2022 17:23

mvashishtha closed this Nov 23, 2022

mvashishtha mentioned this pull request Dec 2, 2022

REFACTOR: Get rid of some hasattr(obj, "columns") checks #5310

Closed


		return self.qc.take_2d(row_lookup, col_lookup)

		def _get_pandas_object_from_qc_view(

	+ "to pandas for the empty frame because we don't check whether the frame is empty. we can't do the insertion correctly right now without defaulting to pandas."
	+ "to pandas for the empty frame because we don't check whether the frame is empty. "
	+ "We can't do the insertion correctly right now without defaulting to pandas."

FEAT-#4931: Create a query compiler that can connect to a service #4932

FEAT-#4931: Create a query compiler that can connect to a service #4932

Conversation

devin-petersohn commented Sep 6, 2022

What do these changes do?

codecov bot commented Sep 6, 2022 • edited Loading

Codecov Report

vnlitvinov commented Sep 7, 2022 • edited Loading

lgtm-com bot commented Sep 16, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lgtm-com bot commented Sep 16, 2022

pyrito left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helmeleegy Nov 1, 2022 • edited Loading

Choose a reason for hiding this comment

helmeleegy commented Nov 1, 2022

lgtm-com bot commented Nov 1, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vnlitvinov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vnlitvinov commented Nov 23, 2022

codecov bot commented Sep 6, 2022 •

edited

Loading

vnlitvinov commented Sep 7, 2022 •

edited

Loading

helmeleegy Nov 1, 2022 •

edited

Loading