Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REFACTOR-#6012: move experimental dispatchers under modin/experimental/... folder #6011

Merged
merged 6 commits into from
Apr 18, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 15 additions & 6 deletions docs/development/architecture.rst
Original file line number Diff line number Diff line change
Expand Up @@ -232,6 +232,10 @@ documentation page on :doc:`contributing </development/contributing>`.
- Uses the Unidist_ execution framework.
- The storage format is `pandas` and the in-memory partition type is a pandas DataFrame.
- For more information on the execution path, see the :doc:`experimental pandas on Unidist </flow/modin/experimental/core/execution/unidist/implementations/pandas_on_unidist/index>` page.
- pandas on Dask (experimental)
- Uses the Dask_ execution framework.
- The storage format is `pandas` and the in-memory partition type is a pandas DataFrame.
- For more information on the execution path, see the :doc:`experimental pandas on Dask </flow/modin/experimental/core/execution/dask/implementations/pandas_on_dask/index>` page.
- :doc:`HDK on Native </development/using_hdk>` (experimental)
- Uses HDK as an engine.
- The storage format is `hdk` and the in-memory partition type is a pyarrow Table. When defaulting to pandas, the pandas DataFrame is used.
Expand Down Expand Up @@ -344,12 +348,16 @@ details. The documentation covers most modules, with more docs being added every
│ │ │ │ │ └───implementations
│ │ │ │ │ ├─── :doc:`pandas_on_ray </flow/modin/experimental/core/execution/ray/implementations/pandas_on_ray/index>`
│ │ │ │ │ └─── :doc:`pyarrow_on_ray </flow/modin/experimental/core/execution/ray/implementations/pyarrow_on_ray>`
│ │ │ │ └───unidist
│ │ │ │ └───implementations
│ │ │ │ └─── :doc:`pandas_on_unidist </flow/modin/experimental/core/execution/unidist/implementations/pandas_on_unidist/index>`
│ │ │ └─── :doc:`storage_formats </flow/modin/experimental/core/storage_formats/index>`
| │ │ ├─── :doc:`hdk </flow/modin/experimental/core/storage_formats/hdk/index>`
│ │ │ └─── :doc:`pyarrow </flow/modin/experimental/core/storage_formats/pyarrow/index>`
│ │ │ │ ├───unidist
│ │ │ │ | └───implementations
│ │ │ │ | └─── :doc:`pandas_on_unidist </flow/modin/experimental/core/execution/unidist/implementations/pandas_on_unidist/index>`
| │ | | └───dask
| | | | └───implementations
│ │ │ │ └─── :doc:`pandas_on_dask </flow/modin/experimental/core/execution/dask/implementations/pandas_on_dask/index>`
│ │ │ ├─── :doc:`storage_formats </flow/modin/experimental/core/storage_formats/index>`
| │ │ | ├─── :doc:`hdk </flow/modin/experimental/core/storage_formats/hdk/index>`
│ │ │ | └─── :doc:`pyarrow </flow/modin/experimental/core/storage_formats/pyarrow/index>`
| | | └─── :doc:`io </flow/modin/experimental/core/io/index>`
│ │ ├─── :doc:`pandas </flow/modin/experimental/pandas>`
│ │ ├─── :doc:`sklearn </flow/modin/experimental/sklearn>`
│ │ ├───spreadsheet
Expand All @@ -368,6 +376,7 @@ details. The documentation covers most modules, with more docs being added every
.. _Ray: https://github.com/ray-project/ray
.. _Unidist: https://github.com/modin-project/unidist
.. _code: https://github.com/modin-project/modin/blob/master/modin/core/dataframe
.. _Dask: https://github.com/dask/dask
.. _Dask Futures: https://docs.dask.org/en/latest/futures.html
.. _issue: https://github.com/modin-project/modin/issues
.. _Discourse: https://discuss.modin.org
Expand Down
3 changes: 1 addition & 2 deletions docs/flow/modin/core/io/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -59,8 +59,7 @@ classes for reading files of different formats.
``_read_rows`` function for moving file offset at the specified amount of rows
and many other functions.

* format/feature specific dispatchers: ``csv_dispatcher.py``, ``csv_glob_dispatcher.py``
(reading multiple files simultaneously, experimental feature), ``excel_dispatcher.py``,
* format/feature specific dispatchers: ``csv_dispatcher.py``, ``excel_dispatcher.py``,
``fwf_dispatcher.py`` and ``json_dispatcher.py``.

* column_stores - directory for storing all columnar store file format dispatcher classes
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
:orphan:
YarShev marked this conversation as resolved.
Show resolved Hide resolved

ExperimentalPandasOnDask Execution
==================================

`ExperimentalPandasOnDask` execution keeps the underlying mechanisms of :doc:`PandasOnDask </flow/modin/core/execution/dask/implementations/pandas_on_dask/index>`
execution architecturally unchanged and adds experimental features of ``Data Transformation``, ``Data Ingress`` and ``Data Egress`` (e.g. :py:func:`~modin.experimental.pandas.read_pickle_distributed`).

PandasOnDask and ExperimentalPandasOnDask differences
-----------------------------------------------------

- another Factory ``PandasOnDaskFactory`` -> ``ExperimentalPandasOnDaskFactory``
- another IO class ``PandasOnDaskIO`` -> ``ExperimentalPandasOnDaskIO``

ExperimentalPandasOnDaskIO classes and modules
----------------------------------------------

- :py:class:`~modin.experimental.core.execution.dask.implementations.pandas_on_dask.io.io.ExperimentalPandasOnDaskIO`
- :py:class:`~modin.core.execution.dispatching.factories.factories.ExperimentalPandasOnDaskFactory`
- :py:class:`~modin.experimental.core.io.text.csv_glob_dispatcher.ExperimentalCSVGlobDispatcher`
- :py:class:`~modin.experimental.core.io.sql.sql_dispatcher.ExperimentalSQLDispatcher`
- :py:class:`~modin.experimental.core.io.pickle.pickle_dispatcher.ExperimentalPickleDispatcher`
- :py:class:`~modin.experimental.core.io.text.custom_text_dispatcher.ExperimentalCustomTextDispatcher`
- :py:class:`~modin.core.storage_formats.pandas.parsers.PandasCSVGlobParser`
- :doc:`ExperimentalPandasOnDask IO module </flow/modin/experimental/core/execution/dask/implementations/pandas_on_dask/io/index>`
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
:orphan:

IO module Description For ExperimentalPandasOnDask Execution
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

High-Level Module Overview
''''''''''''''''''''''''''

This module houses experimental functionality with pandas storage format and Dask
engine. This functionality is concentrated in the :py:class:`~modin.experimental.core.execution.dask.implementations.pandas_on_dask.io.io.ExperimentalPandasOnDaskIO`
class, that contains methods, which extend typical pandas API to give user
more flexibility with IO operations.

Usage Guide
'''''''''''

In order to use the experimental features, just modify standard Modin import
statement as follows:

.. code-block:: python

# import modin.pandas as pd
import modin.experimental.pandas as pd

Submodules Description
''''''''''''''''''''''

The ``modin.experimental.core.execution.dask.implementations.pandas_on_dask`` module primarily houses utils and
functions for the experimental IO class:

* ``io.py`` - submodule containing IO class and parse functions, which are responsible
for data processing on the workers.

Public API
''''''''''

.. autoclass:: modin.experimental.core.execution.dask.implementations.pandas_on_dask.io.io.ExperimentalPandasOnDaskIO
:members:
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,9 @@ ExperimentalPandasOnRayIO classes and modules

- :py:class:`~modin.experimental.core.execution.ray.implementations.pandas_on_ray.io.io.ExperimentalPandasOnRayIO`
- :py:class:`~modin.core.execution.dispatching.factories.factories.ExperimentalPandasOnRayFactory`
- :py:class:`~modin.core.io.text.csv_glob_dispatcher.CSVGlobDispatcher`
- :py:class:`~modin.experimental.core.io.text.csv_glob_dispatcher.ExperimentalCSVGlobDispatcher`
- :py:class:`~modin.experimental.core.io.sql.sql_dispatcher.ExperimentalSQLDispatcher`
- :py:class:`~modin.experimental.core.io.pickle.pickle_dispatcher.ExperimentalPickleDispatcher`
- :py:class:`~modin.experimental.core.io.text.custom_text_dispatcher.ExperimentalCustomTextDispatcher`
- :py:class:`~modin.core.storage_formats.pandas.parsers.PandasCSVGlobParser`
- :doc:`ExperimentalPandasOnRay IO module </flow/modin/experimental/core/execution/ray/implementations/pandas_on_ray/io/index>`
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
:orphan:

IO module Description For Pandas-on-Ray Execution
"""""""""""""""""""""""""""""""""""""""""""""""""
IO module Description For ExperimentalPandasOnRay Execution
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

High-Level Module Overview
''''''''''''''''''''''''''
Expand Down Expand Up @@ -31,8 +31,6 @@ functions for the experimental IO class:
* ``io.py`` - submodule containing IO class and parse functions, which are responsible
for data processing on the workers.

* ``sql.py`` - submodule with util functions for experimental ``read_sql`` function.

Public API
''''''''''

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,9 @@ ExperimentalPandasOnUnidistIO classes and modules

- :py:class:`~modin.experimental.core.execution.unidist.implementations.pandas_on_unidist.io.io.ExperimentalPandasOnUnidistIO`
- :py:class:`~modin.core.execution.dispatching.factories.factories.ExperimentalPandasOnUnidistFactory`
- :py:class:`~modin.core.io.text.csv_glob_dispatcher.CSVGlobDispatcher`
- :py:class:`~modin.experimental.core.io.text.csv_glob_dispatcher.ExperimentalCSVGlobDispatcher`
- :py:class:`~modin.experimental.core.io.sql.sql_dispatcher.ExperimentalSQLDispatcher`
- :py:class:`~modin.experimental.core.io.pickle.pickle_dispatcher.ExperimentalPickleDispatcher`
- :py:class:`~modin.experimental.core.io.text.custom_text_dispatcher.ExperimentalCustomTextDispatcher`
- :py:class:`~modin.core.storage_formats.pandas.parsers.PandasCSVGlobParser`
- :doc:`ExperimentalPandasOnUnidist IO module </flow/modin/experimental/core/execution/unidist/implementations/pandas_on_unidist/io/index>`
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
:orphan:

IO module Description For Pandas-on-Unidist Execution
"""""""""""""""""""""""""""""""""""""""""""""""""""""
IO module Description For ExperimentalPandasOnUnidist Execution
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

High-Level Module Overview
''''''''''''''''''''''''''
Expand Down Expand Up @@ -31,8 +31,6 @@ functions for the experimental IO class:
* ``io.py`` - submodule containing IO class and parse functions, which are responsible
for data processing on the workers.

* ``sql.py`` - submodule with util functions for experimental ``read_sql`` function.

Public API
''''''''''

Expand Down
29 changes: 29 additions & 0 deletions docs/flow/modin/experimental/core/io/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
:orphan:

Experimental IO Module Description
""""""""""""""""""""""""""""""""""

The module is used mostly for storing experimental utils and
dispatcher classes for reading/writing files of different formats.

Submodules Description
''''''''''''''''''''''

* text - directory for storing all text file format dispatcher classes

* format/feature specific dispatchers: ``csv_glob_dispatcher.py``,
``custom_text_dispatcher.py``.

* sql - directory for storing SQL dispatcher class

* format/feature specific dispatchers: ``sql_dispatcher.py``

* pickle - directory for storing Pickle dispatcher class

* format/feature specific dispatchers: ``pickle_dispatcher.py``

Public API
''''''''''

.. automodule:: modin.experimental.core.io
:members:
6 changes: 0 additions & 6 deletions modin/core/io/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,8 @@

from .io import BaseIO
from .text.csv_dispatcher import CSVDispatcher
from .text.csv_glob_dispatcher import CSVGlobDispatcher
from .text.fwf_dispatcher import FWFDispatcher
from .text.json_dispatcher import JSONDispatcher
from .text.custom_text_dispatcher import (
ExperimentalCustomTextDispatcher,
)
from .text.excel_dispatcher import ExcelDispatcher
from .file_dispatcher import FileDispatcher
from .text.text_file_dispatcher import TextFileDispatcher
Expand All @@ -32,7 +28,6 @@
__all__ = [
"BaseIO",
"CSVDispatcher",
"CSVGlobDispatcher",
"FWFDispatcher",
"JSONDispatcher",
"FileDispatcher",
Expand All @@ -42,5 +37,4 @@
"FeatherDispatcher",
"SQLDispatcher",
"ExcelDispatcher",
"ExperimentalCustomTextDispatcher",
]
Original file line number Diff line number Diff line change
Expand Up @@ -25,13 +25,11 @@
)
from modin.core.storage_formats.pandas.query_compiler import PandasQueryCompiler
from modin.core.execution.dask.implementations.pandas_on_dask.io import PandasOnDaskIO
from modin.core.io import (
CSVGlobDispatcher,
ExperimentalCustomTextDispatcher,
)
from modin.experimental.core.io import (
ExperimentalCSVGlobDispatcher,
ExperimentalSQLDispatcher,
ExperimentalPickleDispatcher,
ExperimentalCustomTextDispatcher,
)

from modin.core.execution.dask.implementations.pandas_on_dask.dataframe import (
Expand Down Expand Up @@ -66,7 +64,7 @@ def __make_write(*classes, build_args=build_args):
# used to reduce code duplication
return type("", (DaskWrapper, *classes), build_args).write

read_csv_glob = __make_read(PandasCSVGlobParser, CSVGlobDispatcher)
read_csv_glob = __make_read(PandasCSVGlobParser, ExperimentalCSVGlobDispatcher)
read_pickle_distributed = __make_read(
ExperimentalPandasPickleParser, ExperimentalPickleDispatcher
)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,13 +25,11 @@
)
from modin.core.storage_formats.pandas.query_compiler import PandasQueryCompiler
from modin.core.execution.ray.implementations.pandas_on_ray.io import PandasOnRayIO
from modin.core.io import (
CSVGlobDispatcher,
ExperimentalCustomTextDispatcher,
)
from modin.experimental.core.io import (
ExperimentalCSVGlobDispatcher,
ExperimentalSQLDispatcher,
ExperimentalPickleDispatcher,
ExperimentalCustomTextDispatcher,
)
from modin.core.execution.ray.implementations.pandas_on_ray.dataframe import (
PandasOnRayDataframe,
Expand Down Expand Up @@ -65,7 +63,7 @@ def __make_write(*classes, build_args=build_args):
# used to reduce code duplication
return type("", (RayWrapper, *classes), build_args).write

read_csv_glob = __make_read(PandasCSVGlobParser, CSVGlobDispatcher)
read_csv_glob = __make_read(PandasCSVGlobParser, ExperimentalCSVGlobDispatcher)
read_pickle_distributed = __make_read(
ExperimentalPandasPickleParser, ExperimentalPickleDispatcher
)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,13 +27,11 @@
from modin.core.execution.unidist.implementations.pandas_on_unidist.io import (
PandasOnUnidistIO,
)
from modin.core.io import (
CSVGlobDispatcher,
ExperimentalCustomTextDispatcher,
)
from modin.experimental.core.io import (
ExperimentalCSVGlobDispatcher,
ExperimentalSQLDispatcher,
ExperimentalPickleDispatcher,
ExperimentalCustomTextDispatcher,
)
from modin.core.execution.unidist.implementations.pandas_on_unidist.dataframe import (
PandasOnUnidistDataframe,
Expand Down Expand Up @@ -67,7 +65,7 @@ def __make_write(*classes, build_args=build_args):
# used to reduce code duplication
return type("", (UnidistWrapper, *classes), build_args).write

read_csv_glob = __make_read(PandasCSVGlobParser, CSVGlobDispatcher)
read_csv_glob = __make_read(PandasCSVGlobParser, ExperimentalCSVGlobDispatcher)
read_pickle_distributed = __make_read(
ExperimentalPandasPickleParser, ExperimentalPickleDispatcher
)
Expand Down
4 changes: 4 additions & 0 deletions modin/experimental/core/io/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,14 @@

"""Experimental IO functions implementations."""

from .text.csv_glob_dispatcher import ExperimentalCSVGlobDispatcher
from .sql.sql_dispatcher import ExperimentalSQLDispatcher
from .pickle.pickle_dispatcher import ExperimentalPickleDispatcher
from .text.custom_text_dispatcher import ExperimentalCustomTextDispatcher

__all__ = [
"ExperimentalCSVGlobDispatcher",
"ExperimentalSQLDispatcher",
"ExperimentalPickleDispatcher",
"ExperimentalCustomTextDispatcher",
]
14 changes: 14 additions & 0 deletions modin/experimental/core/io/text/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Licensed to Modin Development Team under one or more contributor license agreements.
# See the NOTICE file distributed with this work for additional information regarding
# copyright ownership. The Modin Development Team licenses this file to you under the
# Apache License, Version 2.0 (the "License"); you may not use this file except in
# compliance with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software distributed under
# the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
# ANY KIND, either express or implied. See the License for the specific language
# governing permissions and limitations under the License.

"""Experimental text format type IO functions implementations."""
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
# ANY KIND, either express or implied. See the License for the specific language
# governing permissions and limitations under the License.

"""Module houses `CSVGlobDispatcher` class, that is used for reading multiple `.csv` files simultaneously."""
"""Module houses `ExperimentalCSVGlobDispatcher` class, that is used for reading multiple `.csv` files simultaneously."""

from contextlib import ExitStack
import csv
Expand All @@ -30,7 +30,7 @@
from modin.core.io.text.csv_dispatcher import CSVDispatcher


class CSVGlobDispatcher(CSVDispatcher):
class ExperimentalCSVGlobDispatcher(CSVDispatcher):
"""Class contains utils for reading multiple `.csv` files simultaneously."""

@classmethod
Expand Down