Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT-#4931: Create a query compiler that can connect to a service #4932

Closed

Conversation

devin-petersohn
Copy link
Collaborator

Signed-off-by: Devin Petersohn devin.petersohn@gmail.com

What do these changes do?

  • commit message follows format outlined here
  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves Create a Query Compiler that can connect to a service #4931
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date
  • added (Issue Number: PR title (PR Number)) and github username to release notes for next major release

@devin-petersohn devin-petersohn requested a review from a team as a code owner September 6, 2022 14:20
@codecov
Copy link

codecov bot commented Sep 6, 2022

Codecov Report

Merging #4932 (aa31871) into master (0a2c0de) will decrease coverage by 16.27%.
The diff coverage is 36.48%.

❗ Current head aa31871 differs from pull request most recent head 6477955. Consider uploading reports for the commit 6477955 to get more accurate results

@@             Coverage Diff             @@
##           master    #4932       +/-   ##
===========================================
- Coverage   84.98%   68.71%   -16.28%     
===========================================
  Files         253      256        +3     
  Lines       19113    19841      +728     
===========================================
- Hits        16243    13633     -2610     
- Misses       2870     6208     +3338     
Impacted Files Coverage Δ
modin/core/execution/client/io.py 0.00% <0.00%> (ø)
modin/core/execution/client/query_compiler.py 36.06% <36.06%> (ø)
modin/config/envvars.py 78.77% <66.66%> (-6.54%) ⬇️
modin/pandas/indexing.py 90.45% <66.66%> (-0.97%) ⬇️
.../core/execution/dispatching/factories/factories.py 86.66% <71.42%> (-1.24%) ⬇️
modin/pandas/base.py 95.06% <93.75%> (-0.25%) ⬇️
modin/pandas/series.py 93.79% <100.00%> (-0.24%) ⬇️
...odin/experimental/core/storage_formats/__init__.py 0.00% <0.00%> (-100.00%) ⬇️
...din/experimental/core/execution/native/__init__.py 0.00% <0.00%> (-100.00%) ⬇️
.../experimental/core/storage_formats/hdk/__init__.py 0.00% <0.00%> (-100.00%) ⬇️
... and 64 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@vnlitvinov
Copy link
Collaborator

vnlitvinov commented Sep 7, 2022

@devin-petersohn @pyrito CI is still red

@lgtm-com
Copy link

lgtm-com bot commented Sep 16, 2022

This pull request introduces 1 alert when merging b21b1fd into 170e5de - view on LGTM.com

new alerts:

  • 1 for Unused import

if filepath_or_buffer.startswith("file://"):
# We will do this so that the backend can know whether this
# is a path or a URL.
filepath_or_buffer = filepath_or_buffer[7:]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can save fsspec.open(filepath_or_buffer) and call path to get the location within the schema. That is probably the cleaner solution here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of dealing with this here, why not have the server handle fsspec paths?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pyrito for local paths, we don't want to include the file://, but for non-local paths like s3:// we do want the prefix

@lgtm-com
Copy link

lgtm-com bot commented Sep 16, 2022

This pull request introduces 1 alert when merging 8ba13c7 into f727c04 - view on LGTM.com

new alerts:

  • 1 for Unused import

@pyrito pyrito force-pushed the service/init branch 3 times, most recently from 9a13b94 to d74cdf0 Compare September 23, 2022 17:40
modin/config/envvars.py Outdated Show resolved Hide resolved
modin/config/envvars.py Outdated Show resolved Hide resolved
modin/config/envvars.py Show resolved Hide resolved
modin/core/execution/client/io.py Outdated Show resolved Hide resolved
modin/core/execution/client/io.py Outdated Show resolved Hide resolved
modin/core/execution/client/query_compiler.py Outdated Show resolved Hide resolved
modin/core/execution/client/query_compiler.py Outdated Show resolved Hide resolved
modin/core/execution/client/query_compiler.py Outdated Show resolved Hide resolved
modin/pandas/base.py Show resolved Hide resolved
modin/pandas/base.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@pyrito pyrito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good @mvashishtha . Thanks for all the hard work on this. I've added a couple of small nit fixes and asked a couple questions.

modin/core/execution/client/container.py Show resolved Hide resolved
modin/core/execution/client/container.py Show resolved Hide resolved
modin/core/storage_formats/base/query_compiler.py Outdated Show resolved Hide resolved
modin/core/storage_formats/pandas/query_compiler.py Outdated Show resolved Hide resolved
Comment on lines +540 to +556
def _set_forwarding_groupby_method(method_name: str):
"""
Define a groupby method that forwards arguments to an inner query compiler.

Parameters
----------
method_name : str
"""

def forwarding_method(self, id, by_is_qc, by, *args, **kwargs):
if by_is_qc:
by = self._qc[by]
new_id = self._generate_id()
self._qc[new_id] = getattr(self._qc[id], method_name)(by, *args, **kwargs)
return new_id

setattr(ForwardingQueryCompilerContainer, method_name, forwarding_method)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to say that this is a super clever way of forwarding methods using getattr and setattr.

Copy link
Collaborator

@helmeleegy helmeleegy Nov 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Saves many lines of code!

modin/core/execution/client/io.py Outdated Show resolved Hide resolved
modin/core/execution/client/query_compiler.py Show resolved Hide resolved
modin/pandas/base.py Show resolved Hide resolved
modin/pandas/base.py Outdated Show resolved Hide resolved
modin/conftest.py Outdated Show resolved Hide resolved
@helmeleegy
Copy link
Collaborator

LGTM overall! Thanks, Mahesh for all the hard work!

@lgtm-com
Copy link

lgtm-com bot commented Nov 1, 2022

This pull request introduces 1 alert when merging b28b83d4f7c18865d39ca64200cc1e5ae5aca5de into a93399c - view on LGTM.com

new alerts:

  • 1 for Wrong number of arguments in a call

devin-petersohn and others added 5 commits November 8, 2022 16:32
…a service

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>
Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>
Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>
mvashishtha and others added 13 commits November 8, 2022 16:32
Co-authored-by: Karthik Velayutham <karthik.velayutham@gmail.com>
Signed-off-by: mvashishtha <mahesh@ponder.io>
Signed-off-by: mvashishtha <mahesh@ponder.io>
Signed-off-by: mvashishtha <mahesh@ponder.io>
Signed-off-by: mvashishtha <mahesh@ponder.io>
Signed-off-by: mvashishtha <mahesh@ponder.io>
…t one multiindexing case

Signed-off-by: mvashishtha <mahesh@ponder.io>
Signed-off-by: mvashishtha <mahesh@ponder.io>
Signed-off-by: mvashishtha <mahesh@ponder.io>
Signed-off-by: mvashishtha <mahesh@ponder.io>
Signed-off-by: mvashishtha <mahesh@ponder.io>
Signed-off-by: mvashishtha <mahesh@ponder.io>
Signed-off-by: mvashishtha <mahesh@ponder.io>
mvashishtha added 5 commits November 9, 2022 10:14
Signed-off-by: mvashishtha <mahesh@ponder.io>
Signed-off-by: mvashishtha <mahesh@ponder.io>
Signed-off-by: mvashishtha <mahesh@ponder.io>
Signed-off-by: mvashishtha <mahesh@ponder.io>
Signed-off-by: mvashishtha <mahesh@ponder.io>
@@ -687,7 +691,7 @@ def reset_index(self, **kwargs):
self._modin_frame.reset_index(drop), shape_hint=shape_hint
)

def astype(self, col_dtypes, **kwargs):
def astype(self, col_dtypes, errors: str):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why has it changed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way the new client query compiler converts types, we can't tell whether things will error until we actually try to cast types within the query compiler. So the query compiler has to know what do with errors. The other query compilers, including base, pandas, and HDK, can ignore the errors. I did the same with drop.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather one would extract this to a separate PR

cls._data_conn = conn

@classmethod
def read_csv(cls, filepath_or_buffer, **kwargs):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If my understanding is correct, this is a network service, that can read arbitrary file in the file system. It seems like a security issue. I think, absolute file paths must be restricted here. Only paths, relative to the user defined directories must be allowed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AndreyPavlenko that's a good question to raise. This implementation doesn't directly imply a network service in my opinion (any query compiler can be used in the service model). Restrictions on what types of files used is perhaps out of scope for this PR, but something we should potentially consider in the future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I don't insist on implementing it in this PR. I think, some configuration capability could be added, to restrict the arbitrary file reading.

Signed-off-by: mvashishtha <mahesh@ponder.io>
Copy link
Collaborator

@vnlitvinov vnlitvinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through some of the files, still have to cover others, but I would rather publish part of feedback earlier.



class StorageFormat(EnvironmentVariable, type=str):
"""Engine to run on a single node of distribution."""

varname = "MODIN_STORAGE_FORMAT"
default = "Pandas"
choices = ("Pandas", "Hdk", "Pyarrow", "Cudf")
choices = ("Pandas", "Hdk", "Pyarrow", "Cudf", "")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather we not use an empty value, as it's impossible in some shells to even set such a variable.

Comment on lines +89 to +102
class DefaultToPandasResult(NamedTuple):
"""
The result of ``default_to_pandas``.

Parameters
----------
result : Any
The result of the operation.
result_is_qc_id : bool
Whether the result is a query compiler ID.
"""

result: Any
result_is_qc_id: bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. is there a reason for it to be enclosed in a class?
  2. why not @dataclass?

result_is_qc_id = isinstance(result, self._query_compiler_class)
if result_is_qc_id:
new_id = self._generate_id()
self._qc[new_id] = result
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when compilers are removed? as far as I see, this is an ever-lasting memory leak right now

Comment on lines +305 to +313
id,
to_replace_is_qc: bool,
regex_is_qc: bool,
to_replace,
value,
inplace,
limit,
regex,
method,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Y u no type hints?

Comment on lines +550 to +551
if by_is_qc:
by = self._qc[by]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this all the time here... is there any reason to pass an explicit bool instead of doing stuff like

if isinstance(by, UUID):
    by = self._qc[by]

?

Comment on lines +173 to +174
},
kwargs.get("errors", "raise"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels like a sneak fix for something unrelated to the PR

Comment on lines +532 to +539
@doc_utils.doc_binary_method(
operation="multiplication", sign="*", self_on_right=True
)
def rmul(self, other, **kwargs): # noqa: PR02
return BinaryDefault.register(pandas.DataFrame.rmul)(
self, other=other, **kwargs
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sneak feature, unrelated to the PR?

Comment on lines -1411 to +1436
def astype(self, col_dtypes, **kwargs): # noqa: PR02
def astype(self, col_dtypes, errors: str):
"""
Convert columns dtypes to given dtypes.

Parameters
----------
col_dtypes : dict
Map for column names and new dtypes.
**kwargs : dict
Serves the compatibility purpose. Does not affect the result.
errors : {"raise", "ignore"}
Control raising of exceptions on invalid data for provided dtype.

Returns
-------
BaseQueryCompiler
New QueryCompiler with updated dtypes.
"""
return DataFrameDefault.register(pandas.DataFrame.astype)(
self, dtype=col_dtypes, **kwargs
self, dtype=col_dtypes, errors=errors
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sneak fix 🙃 here and below

Comment on lines +3121 to +3125
def take_2d_labels(
self,
index,
columns,
):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this function added in this PR?

@@ -687,7 +691,7 @@ def reset_index(self, **kwargs):
self._modin_frame.reset_index(drop), shape_hint=shape_hint
)

def astype(self, col_dtypes, **kwargs):
def astype(self, col_dtypes, errors: str):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather one would extract this to a separate PR

Comment on lines +1230 to +1233
# In case of lazy execution we should bypass these error checking components
# because they can force the materialization of the row or column labels.
if self._query_compiler.lazy_execution:
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will we raise errors later on then?


return self.qc.take_2d(row_lookup, col_lookup)

def _get_pandas_object_from_qc_view(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this refactored here? seems like mixing different changes in single PR

Comment on lines +681 to +688
# If not every element of the key is a scalar, e.g. the key is
# (slice(None), 0), then the key isn't a full key-lookup, and the
# entire key behaves more like a slice than like a scalar.
return (
isinstance(key, tuple)
and len(key) == len(multiindex.levels)
and all(is_scalar(k) for k in key)
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels like a sneak fix of a bug unrelated to this PR, let's not mix things

modin/pandas/series_utils.py Show resolved Hide resolved
condition=get_current_execution() == "Client",
reason=(
"client query compiler uses lazy execution, so we don't default "
+ "to pandas for the empty frame because we don't check whether the frame is empty. we can't do the insertion correctly right now without defaulting to pandas."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
+ "to pandas for the empty frame because we don't check whether the frame is empty. we can't do the insertion correctly right now without defaulting to pandas."
+ "to pandas for the empty frame because we don't check whether the frame is empty. "
+ "We can't do the insertion correctly right now without defaulting to pandas."

Comment on lines +720 to +743
eval_general(
modin_simple,
simple,
lambda df: df.drop(5),
check_exception_type=check_exception_type,
)
eval_general(
modin_simple,
simple,
lambda df: df.drop("C", axis=1),
check_exception_type=check_exception_type,
)
eval_general(
modin_simple,
simple,
lambda df: df.drop([1, 5], axis=1),
check_exception_type=check_exception_type,
)
eval_general(
modin_simple,
simple,
lambda df: df.drop(["A", "C"], axis=1),
check_exception_type=check_exception_type,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
eval_general(
modin_simple,
simple,
lambda df: df.drop(5),
check_exception_type=check_exception_type,
)
eval_general(
modin_simple,
simple,
lambda df: df.drop("C", axis=1),
check_exception_type=check_exception_type,
)
eval_general(
modin_simple,
simple,
lambda df: df.drop([1, 5], axis=1),
check_exception_type=check_exception_type,
)
eval_general(
modin_simple,
simple,
lambda df: df.drop(["A", "C"], axis=1),
check_exception_type=check_exception_type,
)
for func in [lambda df: df.drop(5), lambda df: df.drop("C", axis=1), lambda df: df.drop([1, 5], axis=1), lambda df: df.drop(["A", "C"], axis=1)]:
eval_general(
modin_simple,
simple,
func,
check_exception_type=check_exception_type,
)

probably needs black-formatting, though

Comment on lines -716 to +745
# errors = 'ignore'
# test errors = 'ignore'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh?

@mvashishtha mvashishtha marked this pull request as draft November 16, 2022 17:23
@vnlitvinov
Copy link
Collaborator

@devin-petersohn @mvashishtha do we still need this PR? I think it can be closed for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a Query Compiler that can connect to a service
8 participants