-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT-#4931: Create a query compiler that can connect to a service #4932
FEAT-#4931: Create a query compiler that can connect to a service #4932
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4932 +/- ##
===========================================
- Coverage 84.98% 68.71% -16.28%
===========================================
Files 253 256 +3
Lines 19113 19841 +728
===========================================
- Hits 16243 13633 -2610
- Misses 2870 6208 +3338
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
@devin-petersohn @pyrito CI is still red |
This pull request introduces 1 alert when merging b21b1fd into 170e5de - view on LGTM.com new alerts:
|
modin/core/execution/client/io.py
Outdated
if filepath_or_buffer.startswith("file://"): | ||
# We will do this so that the backend can know whether this | ||
# is a path or a URL. | ||
filepath_or_buffer = filepath_or_buffer[7:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can save fsspec.open(filepath_or_buffer)
and call path
to get the location within the schema. That is probably the cleaner solution here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of dealing with this here, why not have the server handle fsspec
paths?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pyrito for local paths, we don't want to include the file://
, but for non-local paths like s3://
we do want the prefix
This pull request introduces 1 alert when merging 8ba13c7 into f727c04 - view on LGTM.com new alerts:
|
9a13b94
to
d74cdf0
Compare
dfce918
to
a4ccc7e
Compare
c467ec2
to
8d19d4b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really good @mvashishtha . Thanks for all the hard work on this. I've added a couple of small nit fixes and asked a couple questions.
def _set_forwarding_groupby_method(method_name: str): | ||
""" | ||
Define a groupby method that forwards arguments to an inner query compiler. | ||
|
||
Parameters | ||
---------- | ||
method_name : str | ||
""" | ||
|
||
def forwarding_method(self, id, by_is_qc, by, *args, **kwargs): | ||
if by_is_qc: | ||
by = self._qc[by] | ||
new_id = self._generate_id() | ||
self._qc[new_id] = getattr(self._qc[id], method_name)(by, *args, **kwargs) | ||
return new_id | ||
|
||
setattr(ForwardingQueryCompilerContainer, method_name, forwarding_method) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanted to say that this is a super clever way of forwarding methods using getattr
and setattr
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Saves many lines of code!
LGTM overall! Thanks, Mahesh for all the hard work! |
This pull request introduces 1 alert when merging b28b83d4f7c18865d39ca64200cc1e5ae5aca5de into a93399c - view on LGTM.com new alerts:
|
48ca937
to
93e066b
Compare
…a service Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>
Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>
Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>
Co-authored-by: Karthik Velayutham <karthik.velayutham@gmail.com>
Signed-off-by: mvashishtha <mahesh@ponder.io>
Signed-off-by: mvashishtha <mahesh@ponder.io>
Signed-off-by: mvashishtha <mahesh@ponder.io>
Signed-off-by: mvashishtha <mahesh@ponder.io>
Signed-off-by: mvashishtha <mahesh@ponder.io>
…t one multiindexing case Signed-off-by: mvashishtha <mahesh@ponder.io>
Signed-off-by: mvashishtha <mahesh@ponder.io>
Signed-off-by: mvashishtha <mahesh@ponder.io>
Signed-off-by: mvashishtha <mahesh@ponder.io>
Signed-off-by: mvashishtha <mahesh@ponder.io>
Signed-off-by: mvashishtha <mahesh@ponder.io>
Signed-off-by: mvashishtha <mahesh@ponder.io>
f7c8a91
to
92ad6dd
Compare
Signed-off-by: mvashishtha <mahesh@ponder.io>
Signed-off-by: mvashishtha <mahesh@ponder.io>
Signed-off-by: mvashishtha <mahesh@ponder.io>
Signed-off-by: mvashishtha <mahesh@ponder.io>
Signed-off-by: mvashishtha <mahesh@ponder.io>
@@ -687,7 +691,7 @@ def reset_index(self, **kwargs): | |||
self._modin_frame.reset_index(drop), shape_hint=shape_hint | |||
) | |||
|
|||
def astype(self, col_dtypes, **kwargs): | |||
def astype(self, col_dtypes, errors: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why has it changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way the new client query compiler converts types, we can't tell whether things will error until we actually try to cast types within the query compiler. So the query compiler has to know what do with errors. The other query compilers, including base, pandas, and HDK, can ignore the errors
. I did the same with drop
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather one would extract this to a separate PR
cls._data_conn = conn | ||
|
||
@classmethod | ||
def read_csv(cls, filepath_or_buffer, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If my understanding is correct, this is a network service, that can read arbitrary file in the file system. It seems like a security issue. I think, absolute file paths must be restricted here. Only paths, relative to the user defined directories must be allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AndreyPavlenko that's a good question to raise. This implementation doesn't directly imply a network service in my opinion (any query compiler can be used in the service model). Restrictions on what types of files used is perhaps out of scope for this PR, but something we should potentially consider in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I don't insist on implementing it in this PR. I think, some configuration capability could be added, to restrict the arbitrary file reading.
Signed-off-by: mvashishtha <mahesh@ponder.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through some of the files, still have to cover others, but I would rather publish part of feedback earlier.
|
||
|
||
class StorageFormat(EnvironmentVariable, type=str): | ||
"""Engine to run on a single node of distribution.""" | ||
|
||
varname = "MODIN_STORAGE_FORMAT" | ||
default = "Pandas" | ||
choices = ("Pandas", "Hdk", "Pyarrow", "Cudf") | ||
choices = ("Pandas", "Hdk", "Pyarrow", "Cudf", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather we not use an empty value, as it's impossible in some shells to even set such a variable.
class DefaultToPandasResult(NamedTuple): | ||
""" | ||
The result of ``default_to_pandas``. | ||
|
||
Parameters | ||
---------- | ||
result : Any | ||
The result of the operation. | ||
result_is_qc_id : bool | ||
Whether the result is a query compiler ID. | ||
""" | ||
|
||
result: Any | ||
result_is_qc_id: bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- is there a reason for it to be enclosed in a class?
- why not
@dataclass
?
result_is_qc_id = isinstance(result, self._query_compiler_class) | ||
if result_is_qc_id: | ||
new_id = self._generate_id() | ||
self._qc[new_id] = result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when compilers are removed? as far as I see, this is an ever-lasting memory leak right now
id, | ||
to_replace_is_qc: bool, | ||
regex_is_qc: bool, | ||
to_replace, | ||
value, | ||
inplace, | ||
limit, | ||
regex, | ||
method, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Y u no type hints?
if by_is_qc: | ||
by = self._qc[by] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this all the time here... is there any reason to pass an explicit bool
instead of doing stuff like
if isinstance(by, UUID):
by = self._qc[by]
?
}, | ||
kwargs.get("errors", "raise"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this feels like a sneak fix for something unrelated to the PR
@doc_utils.doc_binary_method( | ||
operation="multiplication", sign="*", self_on_right=True | ||
) | ||
def rmul(self, other, **kwargs): # noqa: PR02 | ||
return BinaryDefault.register(pandas.DataFrame.rmul)( | ||
self, other=other, **kwargs | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sneak feature, unrelated to the PR?
def astype(self, col_dtypes, **kwargs): # noqa: PR02 | ||
def astype(self, col_dtypes, errors: str): | ||
""" | ||
Convert columns dtypes to given dtypes. | ||
|
||
Parameters | ||
---------- | ||
col_dtypes : dict | ||
Map for column names and new dtypes. | ||
**kwargs : dict | ||
Serves the compatibility purpose. Does not affect the result. | ||
errors : {"raise", "ignore"} | ||
Control raising of exceptions on invalid data for provided dtype. | ||
|
||
Returns | ||
------- | ||
BaseQueryCompiler | ||
New QueryCompiler with updated dtypes. | ||
""" | ||
return DataFrameDefault.register(pandas.DataFrame.astype)( | ||
self, dtype=col_dtypes, **kwargs | ||
self, dtype=col_dtypes, errors=errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sneak fix 🙃 here and below
def take_2d_labels( | ||
self, | ||
index, | ||
columns, | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this function added in this PR?
@@ -687,7 +691,7 @@ def reset_index(self, **kwargs): | |||
self._modin_frame.reset_index(drop), shape_hint=shape_hint | |||
) | |||
|
|||
def astype(self, col_dtypes, **kwargs): | |||
def astype(self, col_dtypes, errors: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather one would extract this to a separate PR
# In case of lazy execution we should bypass these error checking components | ||
# because they can force the materialization of the row or column labels. | ||
if self._query_compiler.lazy_execution: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will we raise errors later on then?
|
||
return self.qc.take_2d(row_lookup, col_lookup) | ||
|
||
def _get_pandas_object_from_qc_view( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this refactored here? seems like mixing different changes in single PR
# If not every element of the key is a scalar, e.g. the key is | ||
# (slice(None), 0), then the key isn't a full key-lookup, and the | ||
# entire key behaves more like a slice than like a scalar. | ||
return ( | ||
isinstance(key, tuple) | ||
and len(key) == len(multiindex.levels) | ||
and all(is_scalar(k) for k in key) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this feels like a sneak fix of a bug unrelated to this PR, let's not mix things
condition=get_current_execution() == "Client", | ||
reason=( | ||
"client query compiler uses lazy execution, so we don't default " | ||
+ "to pandas for the empty frame because we don't check whether the frame is empty. we can't do the insertion correctly right now without defaulting to pandas." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+ "to pandas for the empty frame because we don't check whether the frame is empty. we can't do the insertion correctly right now without defaulting to pandas." | |
+ "to pandas for the empty frame because we don't check whether the frame is empty. " | |
+ "We can't do the insertion correctly right now without defaulting to pandas." |
eval_general( | ||
modin_simple, | ||
simple, | ||
lambda df: df.drop(5), | ||
check_exception_type=check_exception_type, | ||
) | ||
eval_general( | ||
modin_simple, | ||
simple, | ||
lambda df: df.drop("C", axis=1), | ||
check_exception_type=check_exception_type, | ||
) | ||
eval_general( | ||
modin_simple, | ||
simple, | ||
lambda df: df.drop([1, 5], axis=1), | ||
check_exception_type=check_exception_type, | ||
) | ||
eval_general( | ||
modin_simple, | ||
simple, | ||
lambda df: df.drop(["A", "C"], axis=1), | ||
check_exception_type=check_exception_type, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eval_general( | |
modin_simple, | |
simple, | |
lambda df: df.drop(5), | |
check_exception_type=check_exception_type, | |
) | |
eval_general( | |
modin_simple, | |
simple, | |
lambda df: df.drop("C", axis=1), | |
check_exception_type=check_exception_type, | |
) | |
eval_general( | |
modin_simple, | |
simple, | |
lambda df: df.drop([1, 5], axis=1), | |
check_exception_type=check_exception_type, | |
) | |
eval_general( | |
modin_simple, | |
simple, | |
lambda df: df.drop(["A", "C"], axis=1), | |
check_exception_type=check_exception_type, | |
) | |
for func in [lambda df: df.drop(5), lambda df: df.drop("C", axis=1), lambda df: df.drop([1, 5], axis=1), lambda df: df.drop(["A", "C"], axis=1)]: | |
eval_general( | |
modin_simple, | |
simple, | |
func, | |
check_exception_type=check_exception_type, | |
) |
probably needs black
-formatting, though
# errors = 'ignore' | ||
# test errors = 'ignore' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh?
@devin-petersohn @mvashishtha do we still need this PR? I think it can be closed for now. |
Signed-off-by: Devin Petersohn devin.petersohn@gmail.com
What do these changes do?
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date