Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT-#4931: Create a query compiler that can connect to a service #4932

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
1392493
FEAT-#4931: Create a query compiler that can connect to a service
devin-petersohn Sep 6, 2022
3797403
Fixes to pass CI + docs for io.py
Sep 6, 2022
dd0e7a5
Update implementation
devin-petersohn Sep 6, 2022
026a91c
Fix some things
devin-petersohn Sep 6, 2022
ea0ac1d
Lint fixes
Sep 7, 2022
c18342e
Fix put
devin-petersohn Sep 7, 2022
711c819
Clean up and add new details
devin-petersohn Sep 15, 2022
e5c5f61
Use fsspec to get full path and allow URLs
devin-petersohn Sep 16, 2022
538dd54
Add lazy loc
devin-petersohn Sep 16, 2022
4c3dec6
fixes for tests
batur-ponder Sep 20, 2022
1f9797c
porting more tests
batur-ponder Sep 21, 2022
26d0ddc
more fixes
batur-ponder Sep 21, 2022
2489b33
moar fixes
batur-ponder Sep 21, 2022
3699df4
Raise exception
devin-petersohn Sep 22, 2022
c399ce2
Lint fixes
Sep 22, 2022
c785810
Return Python as the default modin engine
Sep 23, 2022
3e09a7f
Handle indexing case for client qc
Sep 23, 2022
ad0bc7b
Call fast path for __getitem__ if not lazy
Sep 23, 2022
2f4fbf0
Remove user warning for Python-engine fall back
Sep 26, 2022
4b61374
Add init
devin-petersohn Sep 24, 2022
485793c
Implement free as a no-op
devin-petersohn Sep 26, 2022
5d5a617
Add support for replace - client side
helmeleegy Sep 24, 2022
8b16988
Fix a couple of issues with Client
devin-petersohn Sep 26, 2022
4485cc8
Throw errors on to_pandas
devin-petersohn Sep 26, 2022
7fd51b2
Do not default to pandas for str_repeat
helmeleegy Sep 27, 2022
a12fb00
Add support for 18 datetime functions/properties
helmeleegy Sep 30, 2022
613ba25
Fix columns caching when renaming columns
helmeleegy Oct 5, 2022
0450f7c
Fix test_query: put backticks back for col names
helmeleegy Oct 5, 2022
679813c
Add support for astype -- client side
helmeleegy Oct 17, 2022
dff3d54
Make client query compiler consistent with other query compiler. cons…
Oct 25, 2022
18cf725
Fix black.
Oct 25, 2022
ea5dc77
Fix black and flake8.
Oct 25, 2022
773eff0
Hook up IO and test query compiler, but service missing methods that …
Oct 26, 2022
87699b8
Fix up the service and test_general passes with execution 'client'.
Oct 26, 2022
d56db3f
got test_indexing.py to pass, going in order through test-defaults.
Oct 27, 2022
3f9bf8c
ci.yml tests pass through test_map_metadata.
Oct 27, 2022
86d489a
Tests pass through test_reduce.
Oct 27, 2022
6946ae8
pass through test_udf.py and enable another skipped test.
Oct 27, 2022
4f4831b
Pass through test_series, skipping pickle.
Oct 27, 2022
577c989
Tests pass through test_general.
Oct 28, 2022
b74a95c
TestCsv and TestSql pass.
Oct 28, 2022
81583d2
Fix pydocstyle for qc and io.
Oct 28, 2022
7dc093d
REFACTOR: Dedupe single ID service methods.
Oct 28, 2022
7405550
REFACTOR: Dedupe binary code and refactor some is_qc.
Oct 28, 2022
89ba4b0
Fix query compiler refactoring.
Oct 28, 2022
0a3240f
Add a newline for black
Oct 28, 2022
df5b2a5
Make doc_checker work for all new files except container groupby.
Oct 29, 2022
7d2751a
Fix all docstrings and add ci.yml and push.yml.
Oct 29, 2022
aa5be58
Add binary methods from hazem's dfce9189226190bddf6aacab35cbcf44e1a74…
Oct 29, 2022
1e3bdc6
Fix CI falures.
Oct 29, 2022
f9e0605
Fix more tests.
Oct 29, 2022
09c07f9
Fix flake8.
Oct 29, 2022
0f53343
Fix some tests.
Oct 29, 2022
ff94782
Update modin/core/execution/client/io.py
mvashishtha Oct 29, 2022
d4fbf0a
Fix black.
Oct 29, 2022
6119b8e
Fix omnisci by restoring lazy execution check.
Oct 29, 2022
7db25b7
Try fixing Client io yml.
Oct 29, 2022
86bbc75
Make test dataset size normal so I/O tests pass.
Oct 29, 2022
8826c54
Apply suggestions from code review
mvashishtha Oct 31, 2022
584ef10
Address comments.
Oct 31, 2022
1871154
Respond to comments.
Nov 1, 2022
1707390
Fix fuzzydata by making getitem_row_array use numeric=True everywhere.
Nov 1, 2022
e7af275
Pass errors through astype.
Nov 1, 2022
edf99f8
Fix astype errors.
Nov 2, 2022
e74db7c
Use new take_2d_labels for most insertion. test_indexing passes excep…
Nov 7, 2022
b616c28
Actually use client query compiler.
Nov 8, 2022
832556d
Fix multiindex and fix doc_checker.
Nov 8, 2022
51fd254
Fix IO astype bug.
Nov 8, 2022
ebe2719
Make ClientIO use ClientQueryCompiler by default.
Nov 8, 2022
f205801
Debug read_sql.
Nov 8, 2022
92ad6dd
Fix getitem_row_array.
Nov 8, 2022
1d7e494
Fix black and flake8, and add a comment.
Nov 9, 2022
8be834a
Fix getitem_row_array.
Nov 9, 2022
61a7aad
Fix getitem_row_array again.
Nov 9, 2022
07ad5c5
Fix bugs that showed up in CI.
Nov 9, 2022
38ef127
Fix a multiindex Client bug, and fix an hdk astype bug.
Nov 9, 2022
ead877e
Respond to comments.
Nov 10, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 17 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,10 @@ jobs:
- run: python scripts/doc_checker.py modin/core/storage_formats/base
- run: python scripts/doc_checker.py modin/experimental/core/storage_formats/pyarrow
- run: python scripts/doc_checker.py modin/core/storage_formats/pandas
- run: |
python scripts/doc_checker.py modin/core/execution/client/container.py
python scripts/doc_checker.py modin/core/execution/client/io.py
python scripts/doc_checker.py modin/core/execution/client/query_compiler.py
- run: |
python scripts/doc_checker.py \
modin/experimental/core/execution/native/implementations/hdk_on_native/dataframe \
Expand Down Expand Up @@ -335,7 +339,7 @@ jobs:
shell: bash -l {0}
strategy:
matrix:
execution: [BaseOnPython]
execution: [BaseOnPython, Client]
env:
MODIN_TEST_DATASET_SIZE: "small"
name: Test ${{ matrix.execution }} execution, Python 3.8
Expand Down Expand Up @@ -368,6 +372,8 @@ jobs:
- name: Install HDF5
run: sudo apt update && sudo apt install -y libhdf5-dev
- run: pytest modin/experimental/xgboost/test/test_default.py --execution=${{ matrix.execution }}
# Client execution doesn't need to work with xgboost
if: matrix.execution != 'Client'
- run: python -m pytest -n 2 modin/test/storage_formats/base/test_internals.py --execution=${{ matrix.execution }}
- run: pytest -n 2 modin/pandas/test/dataframe/test_binary.py --execution=${{ matrix.execution }}
- run: pytest -n 2 modin/pandas/test/dataframe/test_default.py --execution=${{ matrix.execution }}
Expand All @@ -379,12 +385,22 @@ jobs:
- run: pytest -n 2 modin/pandas/test/dataframe/test_udf.py --execution=${{ matrix.execution }}
- run: pytest -n 2 modin/pandas/test/dataframe/test_window.py --execution=${{ matrix.execution }}
- run: pytest -n 2 modin/pandas/test/dataframe/test_pickle.py --execution=${{ matrix.execution }}
# Client execution dosen't need to pickle modin.pandas objects.
if: matrix.execution != 'Client'
- run: python -m pytest -n 2 modin/pandas/test/test_series.py --execution=${{ matrix.execution }}
- run: python -m pytest -n 2 modin/pandas/test/test_rolling.py --execution=${{ matrix.execution }}
- run: python -m pytest -n 2 modin/pandas/test/test_concat.py --execution=${{ matrix.execution }}
- run: python -m pytest -n 2 modin/pandas/test/test_groupby.py --execution=${{ matrix.execution }}
- run: python -m pytest -n 2 modin/pandas/test/test_reshape.py --execution=${{ matrix.execution }}
- run: python -m pytest -n 2 modin/pandas/test/test_general.py --execution=${{ matrix.execution }}
- name: Test I/O
# note that if test dataset size is small like for the other tests in
# this job, the tests fail.
run: |
MODIN_TEST_DATASET_SIZE=NORMAL python -m pytest modin/pandas/test/test_io.py::TestCsv --execution=${{ matrix.execution }}
MODIN_TEST_DATASET_SIZE=NORMAL python -m pytest modin/pandas/test/test_io.py::TestSql --execution=${{ matrix.execution }}
# Client has to be able to to do CSV and SQL I/O.
if: matrix.execution == 'Client'
- uses: codecov/codecov-action@v2

test-hdk:
Expand Down
10 changes: 10 additions & 0 deletions .github/workflows/push.yml
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,8 @@ jobs:
- name: Install HDF5
run: sudo apt update && sudo apt install -y libhdf5-dev
- run: pytest -n 2 modin/experimental/xgboost/test/test_default.py --execution=${{ matrix.execution }}
# Client execution doesn't need to work with xgboost
if: matrix.execution != 'Client'
- run: pytest -n 2 modin/pandas/test/dataframe/test_binary.py --execution=${{ matrix.execution }}
- run: pytest -n 2 modin/pandas/test/dataframe/test_default.py --execution=${{ matrix.execution }}
- run: pytest -n 2 modin/pandas/test/dataframe/test_indexing.py --execution=${{ matrix.execution }}
Expand All @@ -94,12 +96,20 @@ jobs:
- run: pytest -n 2 modin/pandas/test/dataframe/test_udf.py --execution=${{ matrix.execution }}
- run: pytest -n 2 modin/pandas/test/dataframe/test_window.py --execution=${{ matrix.execution }}
- run: pytest -n 2 modin/pandas/test/dataframe/test_pickle.py --execution=${{ matrix.execution }}
# Client execution dosen't need to pickle modin.pandas objects.
if: matrix.execution != 'Client'
- run: python -m pytest -n 2 modin/pandas/test/test_series.py --execution=${{ matrix.execution }}
- run: python -m pytest -n 2 modin/pandas/test/test_rolling.py --execution=${{ matrix.execution }}
- run: python -m pytest -n 2 modin/pandas/test/test_concat.py --execution=${{ matrix.execution }}
- run: python -m pytest -n 2 modin/pandas/test/test_groupby.py --execution=${{ matrix.execution }}
- run: python -m pytest -n 2 modin/pandas/test/test_reshape.py --execution=${{ matrix.execution }}
- run: python -m pytest -n 2 modin/pandas/test/test_general.py --execution=${{ matrix.execution }}
- name: I/O tests
run: |
python -m pytest modin/pandas/test/test_io.py::TestCsv --execution=${{ matrix.execution }}
python -m pytest modin/pandas/test/test_io.py::TestSql --execution=${{ matrix.execution }}
# Client has to be able to to CSV and SQL I/O.
if: matrix.execution == 'Client'
- uses: codecov/codecov-action@v2

test-hdk:
Expand Down
11 changes: 6 additions & 5 deletions modin/config/envvars.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ class Engine(EnvironmentVariable, type=str):
"""Distribution engine to run queries by."""

varname = "MODIN_ENGINE"
choices = ("Ray", "Dask", "Python", "Native")
choices = ("Ray", "Dask", "Python", "Native", "Client")

@classmethod
def _get_default(cls) -> str:
Expand Down Expand Up @@ -131,17 +131,18 @@ def _get_default(cls) -> str:
pass
else:
return "Native"
raise ImportError(
"Please refer to installation documentation page to install an engine"
)

# If we can't import any other engines we should go ahead and default to Python being
# the default backend engine.
return "Python"
mvashishtha marked this conversation as resolved.
Show resolved Hide resolved


class StorageFormat(EnvironmentVariable, type=str):
"""Engine to run on a single node of distribution."""

varname = "MODIN_STORAGE_FORMAT"
default = "Pandas"
choices = ("Pandas", "Hdk", "Pyarrow", "Cudf")
choices = ("Pandas", "Hdk", "Pyarrow", "Cudf", "")
mvashishtha marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather we not use an empty value, as it's impossible in some shells to even set such a variable.



class IsExperimental(EnvironmentVariable, type=bool):
Expand Down
72 changes: 59 additions & 13 deletions modin/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
import numpy as np
import shutil
from typing import Optional
from uuid import uuid4

assert (
"modin.utils" not in sys.modules
Expand All @@ -46,12 +47,17 @@ def _saving_make_api_url(token, _make_api_url=modin.utils._make_api_url):
import modin # noqa: E402
import modin.config # noqa: E402
from modin.config import IsExperimental, TestRayClient # noqa: E402
import uuid # noqa: E402

from modin.core.storage_formats import ( # noqa: E402
PandasQueryCompiler,
BaseQueryCompiler,
)
from modin.core.execution.client.container import ( # noqa: E402
ForwardingQueryCompilerContainer,
)
from modin.core.execution.python.implementations.pandas_on_python.dataframe.dataframe import ( # noqa: E402
PandasOnPythonDataframe,
)
from modin.core.execution.python.implementations.pandas_on_python.io import ( # noqa: E402
PandasOnPythonIO,
)
Expand All @@ -63,6 +69,7 @@ def _saving_make_api_url(token, _make_api_url=modin.utils._make_api_url):
make_default_file,
teardown_test_files,
NROWS,
default_to_pandas_ignore_string,
)


Expand Down Expand Up @@ -223,9 +230,6 @@ def __iter__(self):
os.environ = orig_env


BASE_EXECUTION_NAME = "BaseOnPython"


class TestQC(BaseQueryCompiler):
def __init__(self, modin_frame):
self._modin_frame = modin_frame
Expand Down Expand Up @@ -269,16 +273,57 @@ def prepare(cls):
cls.io_cls = BaseOnPythonIO


def set_base_execution(name=BASE_EXECUTION_NAME):
setattr(factories, f"{name}Factory", BaseOnPythonFactory)
modin.set_execution(engine="python", storage_format=name.split("On")[0])
def set_base_on_python_execution():
factories.BaseOnPythonFactory = BaseOnPythonFactory
modin.set_execution(engine="python", storage_format="Base")


class ClientFactory(factories.BaseFactory):
@classmethod
def prepare(cls):
# Can't always import ClientIO, because it uses NoDefault, which
# is not available on older pandas.
from modin.core.execution.client.io import ClientIO

cls.io_cls = ClientIO


def set_client_execution():
# Can't always import ClientQueryCompiler, because it uses NoDefault, which
# is not available on older pandas. ClientIO also uses ClientQueryCompiler.

from modin.core.execution.client.query_compiler import ClientQueryCompiler
from modin.core.execution.client.io import ClientIO

class TestClientQueryCompiler(ClientQueryCompiler):
@classmethod
def from_pandas(cls, df, data_cls):
return cls(
cls._service.add_query_compiler(TestQC.from_pandas(df, data_cls))
)

def default_to_pandas(self, pandas_op, *args, **kwargs):
result = self._service.default_to_pandas(
self._id, pandas_op, *args, **kwargs
)
if result.result_is_qc_id:
return self.__constructor__(result.result)
return result.result

service = ForwardingQueryCompilerContainer(BaseQueryCompiler, BaseOnPythonIO)
ClientQueryCompiler.set_server_connection(service)
ClientIO.query_compiler_cls = TestClientQueryCompiler
ClientIO.set_server_connection(service)
ClientIO.frame_cls = PandasOnPythonDataframe
factories.ClientFactory = ClientFactory
modin.set_execution(engine="Client", storage_format="")


@pytest.fixture(scope="function")
def get_unique_base_execution():
"""Setup unique execution for a single function and yield its QueryCompiler that's suitable for inplace modifications."""
# It's better to use decimal IDs rather than hex ones due to factory names formatting
execution_id = int(uuid.uuid4().hex, 16)
execution_id = int(uuid4().hex, 16)
format_name = f"Base{execution_id}"
engine_name = "Python"
execution_name = f"{format_name}On{engine_name}"
Expand Down Expand Up @@ -319,11 +364,12 @@ def pytest_configure(config):
if execution is None:
return

if execution == BASE_EXECUTION_NAME:
set_base_execution(BASE_EXECUTION_NAME)
config.addinivalue_line(
"filterwarnings", "default:.*defaulting to pandas.*:UserWarning"
)
if execution == "BaseOnPython":
set_base_on_python_execution()
config.addinivalue_line("filterwarnings", default_to_pandas_ignore_string)
elif execution == "Client":
set_client_execution()
config.addinivalue_line("filterwarnings", default_to_pandas_ignore_string)
else:
partition, engine = execution.split("On")
modin.set_execution(engine=engine, storage_format=partition)
Expand Down
Empty file.
Loading