diff --git a/docs/development/architecture.rst b/docs/development/architecture.rst index 023bd3db412..457895e0033 100644 --- a/docs/development/architecture.rst +++ b/docs/development/architecture.rst @@ -232,6 +232,10 @@ documentation page on :doc:`contributing `. - Uses the Unidist_ execution framework. - The storage format is `pandas` and the in-memory partition type is a pandas DataFrame. - For more information on the execution path, see the :doc:`experimental pandas on Unidist ` page. +- pandas on Dask (experimental) + - Uses the Dask_ execution framework. + - The storage format is `pandas` and the in-memory partition type is a pandas DataFrame. + - For more information on the execution path, see the :doc:`experimental pandas on Dask ` page. - :doc:`HDK on Native ` (experimental) - Uses HDK as an engine. - The storage format is `hdk` and the in-memory partition type is a pyarrow Table. When defaulting to pandas, the pandas DataFrame is used. @@ -344,12 +348,16 @@ details. The documentation covers most modules, with more docs being added every │ │ │ │ │ └───implementations │ │ │ │ │ ├─── :doc:`pandas_on_ray ` │ │ │ │ │ └─── :doc:`pyarrow_on_ray ` - │ │ │ │ └───unidist - │ │ │ │ └───implementations - │ │ │ │ └─── :doc:`pandas_on_unidist ` - │ │ │ └─── :doc:`storage_formats ` - | │ │ ├─── :doc:`hdk ` - │ │ │ └─── :doc:`pyarrow ` + │ │ │ │ ├───unidist + │ │ │ │ | └───implementations + │ │ │ │ | └─── :doc:`pandas_on_unidist ` + | │ | | └───dask + | | | | └───implementations + │ │ │ │ └─── :doc:`pandas_on_dask ` + │ │ │ ├─── :doc:`storage_formats ` + | │ │ | ├─── :doc:`hdk ` + │ │ │ | └─── :doc:`pyarrow ` + | | | └─── :doc:`io ` │ │ ├─── :doc:`pandas ` │ │ ├─── :doc:`sklearn ` │ │ ├───spreadsheet @@ -368,6 +376,7 @@ details. The documentation covers most modules, with more docs being added every .. _Ray: https://github.com/ray-project/ray .. _Unidist: https://github.com/modin-project/unidist .. _code: https://github.com/modin-project/modin/blob/master/modin/core/dataframe +.. _Dask: https://github.com/dask/dask .. _Dask Futures: https://docs.dask.org/en/latest/futures.html .. _issue: https://github.com/modin-project/modin/issues .. _Discourse: https://discuss.modin.org diff --git a/docs/flow/modin/core/io/index.rst b/docs/flow/modin/core/io/index.rst index 5ec92d4ea96..88054da6142 100644 --- a/docs/flow/modin/core/io/index.rst +++ b/docs/flow/modin/core/io/index.rst @@ -59,8 +59,7 @@ classes for reading files of different formats. ``_read_rows`` function for moving file offset at the specified amount of rows and many other functions. - * format/feature specific dispatchers: ``csv_dispatcher.py``, ``csv_glob_dispatcher.py`` - (reading multiple files simultaneously, experimental feature), ``excel_dispatcher.py``, + * format/feature specific dispatchers: ``csv_dispatcher.py``, ``excel_dispatcher.py``, ``fwf_dispatcher.py`` and ``json_dispatcher.py``. * column_stores - directory for storing all columnar store file format dispatcher classes diff --git a/docs/flow/modin/experimental/core/execution/dask/implementations/pandas_on_dask/index.rst b/docs/flow/modin/experimental/core/execution/dask/implementations/pandas_on_dask/index.rst new file mode 100644 index 00000000000..543b3381695 --- /dev/null +++ b/docs/flow/modin/experimental/core/execution/dask/implementations/pandas_on_dask/index.rst @@ -0,0 +1,25 @@ +:orphan: + +ExperimentalPandasOnDask Execution +================================== + +`ExperimentalPandasOnDask` execution keeps the underlying mechanisms of :doc:`PandasOnDask ` +execution architecturally unchanged and adds experimental features of ``Data Transformation``, ``Data Ingress`` and ``Data Egress`` (e.g. :py:func:`~modin.experimental.pandas.read_pickle_distributed`). + +PandasOnDask and ExperimentalPandasOnDask differences +----------------------------------------------------- + +- another Factory ``PandasOnDaskFactory`` -> ``ExperimentalPandasOnDaskFactory`` +- another IO class ``PandasOnDaskIO`` -> ``ExperimentalPandasOnDaskIO`` + +ExperimentalPandasOnDaskIO classes and modules +---------------------------------------------- + +- :py:class:`~modin.experimental.core.execution.dask.implementations.pandas_on_dask.io.io.ExperimentalPandasOnDaskIO` +- :py:class:`~modin.core.execution.dispatching.factories.factories.ExperimentalPandasOnDaskFactory` +- :py:class:`~modin.experimental.core.io.text.csv_glob_dispatcher.ExperimentalCSVGlobDispatcher` +- :py:class:`~modin.experimental.core.io.sql.sql_dispatcher.ExperimentalSQLDispatcher` +- :py:class:`~modin.experimental.core.io.pickle.pickle_dispatcher.ExperimentalPickleDispatcher` +- :py:class:`~modin.experimental.core.io.text.custom_text_dispatcher.ExperimentalCustomTextDispatcher` +- :py:class:`~modin.core.storage_formats.pandas.parsers.PandasCSVGlobParser` +- :doc:`ExperimentalPandasOnDask IO module ` diff --git a/docs/flow/modin/experimental/core/execution/dask/implementations/pandas_on_dask/io/index.rst b/docs/flow/modin/experimental/core/execution/dask/implementations/pandas_on_dask/io/index.rst new file mode 100644 index 00000000000..58edd6b58f3 --- /dev/null +++ b/docs/flow/modin/experimental/core/execution/dask/implementations/pandas_on_dask/io/index.rst @@ -0,0 +1,38 @@ +:orphan: + +IO module Description For ExperimentalPandasOnDask Execution +"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" + +High-Level Module Overview +'''''''''''''''''''''''''' + +This module houses experimental functionality with pandas storage format and Dask +engine. This functionality is concentrated in the :py:class:`~modin.experimental.core.execution.dask.implementations.pandas_on_dask.io.io.ExperimentalPandasOnDaskIO` +class, that contains methods, which extend typical pandas API to give user +more flexibility with IO operations. + +Usage Guide +''''''''''' + +In order to use the experimental features, just modify standard Modin import +statement as follows: + +.. code-block:: python + + # import modin.pandas as pd + import modin.experimental.pandas as pd + +Submodules Description +'''''''''''''''''''''' + +The ``modin.experimental.core.execution.dask.implementations.pandas_on_dask`` module primarily houses utils and +functions for the experimental IO class: + +* ``io.py`` - submodule containing IO class and parse functions, which are responsible + for data processing on the workers. + +Public API +'''''''''' + +.. autoclass:: modin.experimental.core.execution.dask.implementations.pandas_on_dask.io.io.ExperimentalPandasOnDaskIO + :members: diff --git a/docs/flow/modin/experimental/core/execution/ray/implementations/pandas_on_ray/index.rst b/docs/flow/modin/experimental/core/execution/ray/implementations/pandas_on_ray/index.rst index 4e9ddedacb1..c9b17d29045 100644 --- a/docs/flow/modin/experimental/core/execution/ray/implementations/pandas_on_ray/index.rst +++ b/docs/flow/modin/experimental/core/execution/ray/implementations/pandas_on_ray/index.rst @@ -17,6 +17,9 @@ ExperimentalPandasOnRayIO classes and modules - :py:class:`~modin.experimental.core.execution.ray.implementations.pandas_on_ray.io.io.ExperimentalPandasOnRayIO` - :py:class:`~modin.core.execution.dispatching.factories.factories.ExperimentalPandasOnRayFactory` -- :py:class:`~modin.core.io.text.csv_glob_dispatcher.CSVGlobDispatcher` +- :py:class:`~modin.experimental.core.io.text.csv_glob_dispatcher.ExperimentalCSVGlobDispatcher` +- :py:class:`~modin.experimental.core.io.sql.sql_dispatcher.ExperimentalSQLDispatcher` +- :py:class:`~modin.experimental.core.io.pickle.pickle_dispatcher.ExperimentalPickleDispatcher` +- :py:class:`~modin.experimental.core.io.text.custom_text_dispatcher.ExperimentalCustomTextDispatcher` - :py:class:`~modin.core.storage_formats.pandas.parsers.PandasCSVGlobParser` - :doc:`ExperimentalPandasOnRay IO module ` diff --git a/docs/flow/modin/experimental/core/execution/ray/implementations/pandas_on_ray/io/index.rst b/docs/flow/modin/experimental/core/execution/ray/implementations/pandas_on_ray/io/index.rst index 7de322ce4db..357ceacda76 100644 --- a/docs/flow/modin/experimental/core/execution/ray/implementations/pandas_on_ray/io/index.rst +++ b/docs/flow/modin/experimental/core/execution/ray/implementations/pandas_on_ray/io/index.rst @@ -1,7 +1,7 @@ :orphan: -IO module Description For Pandas-on-Ray Execution -""""""""""""""""""""""""""""""""""""""""""""""""" +IO module Description For ExperimentalPandasOnRay Execution +""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" High-Level Module Overview '''''''''''''''''''''''''' @@ -31,8 +31,6 @@ functions for the experimental IO class: * ``io.py`` - submodule containing IO class and parse functions, which are responsible for data processing on the workers. -* ``sql.py`` - submodule with util functions for experimental ``read_sql`` function. - Public API '''''''''' diff --git a/docs/flow/modin/experimental/core/execution/unidist/implementations/pandas_on_unidist/index.rst b/docs/flow/modin/experimental/core/execution/unidist/implementations/pandas_on_unidist/index.rst index 1c504a3520f..52043231ecc 100644 --- a/docs/flow/modin/experimental/core/execution/unidist/implementations/pandas_on_unidist/index.rst +++ b/docs/flow/modin/experimental/core/execution/unidist/implementations/pandas_on_unidist/index.rst @@ -17,6 +17,9 @@ ExperimentalPandasOnUnidistIO classes and modules - :py:class:`~modin.experimental.core.execution.unidist.implementations.pandas_on_unidist.io.io.ExperimentalPandasOnUnidistIO` - :py:class:`~modin.core.execution.dispatching.factories.factories.ExperimentalPandasOnUnidistFactory` -- :py:class:`~modin.core.io.text.csv_glob_dispatcher.CSVGlobDispatcher` +- :py:class:`~modin.experimental.core.io.text.csv_glob_dispatcher.ExperimentalCSVGlobDispatcher` +- :py:class:`~modin.experimental.core.io.sql.sql_dispatcher.ExperimentalSQLDispatcher` +- :py:class:`~modin.experimental.core.io.pickle.pickle_dispatcher.ExperimentalPickleDispatcher` +- :py:class:`~modin.experimental.core.io.text.custom_text_dispatcher.ExperimentalCustomTextDispatcher` - :py:class:`~modin.core.storage_formats.pandas.parsers.PandasCSVGlobParser` - :doc:`ExperimentalPandasOnUnidist IO module ` diff --git a/docs/flow/modin/experimental/core/execution/unidist/implementations/pandas_on_unidist/io/index.rst b/docs/flow/modin/experimental/core/execution/unidist/implementations/pandas_on_unidist/io/index.rst index ce1cbe4327d..b1fcbaff026 100644 --- a/docs/flow/modin/experimental/core/execution/unidist/implementations/pandas_on_unidist/io/index.rst +++ b/docs/flow/modin/experimental/core/execution/unidist/implementations/pandas_on_unidist/io/index.rst @@ -1,7 +1,7 @@ :orphan: -IO module Description For Pandas-on-Unidist Execution -""""""""""""""""""""""""""""""""""""""""""""""""""""" +IO module Description For ExperimentalPandasOnUnidist Execution +""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" High-Level Module Overview '''''''''''''''''''''''''' @@ -31,8 +31,6 @@ functions for the experimental IO class: * ``io.py`` - submodule containing IO class and parse functions, which are responsible for data processing on the workers. -* ``sql.py`` - submodule with util functions for experimental ``read_sql`` function. - Public API '''''''''' diff --git a/docs/flow/modin/experimental/core/io/index.rst b/docs/flow/modin/experimental/core/io/index.rst new file mode 100644 index 00000000000..820de0b1827 --- /dev/null +++ b/docs/flow/modin/experimental/core/io/index.rst @@ -0,0 +1,29 @@ +:orphan: + +Experimental IO Module Description +"""""""""""""""""""""""""""""""""" + +The module is used mostly for storing experimental utils and +dispatcher classes for reading/writing files of different formats. + +Submodules Description +'''''''''''''''''''''' + +* text - directory for storing all text file format dispatcher classes + + * format/feature specific dispatchers: ``csv_glob_dispatcher.py``, + ``custom_text_dispatcher.py``. + +* sql - directory for storing SQL dispatcher class + + * format/feature specific dispatchers: ``sql_dispatcher.py`` + +* pickle - directory for storing Pickle dispatcher class + + * format/feature specific dispatchers: ``pickle_dispatcher.py`` + +Public API +'''''''''' + +.. automodule:: modin.experimental.core.io + :members: diff --git a/modin/core/io/__init__.py b/modin/core/io/__init__.py index f258d0960aa..e542753e474 100644 --- a/modin/core/io/__init__.py +++ b/modin/core/io/__init__.py @@ -15,12 +15,8 @@ from .io import BaseIO from .text.csv_dispatcher import CSVDispatcher -from .text.csv_glob_dispatcher import CSVGlobDispatcher from .text.fwf_dispatcher import FWFDispatcher from .text.json_dispatcher import JSONDispatcher -from .text.custom_text_dispatcher import ( - ExperimentalCustomTextDispatcher, -) from .text.excel_dispatcher import ExcelDispatcher from .file_dispatcher import FileDispatcher from .text.text_file_dispatcher import TextFileDispatcher @@ -32,7 +28,6 @@ __all__ = [ "BaseIO", "CSVDispatcher", - "CSVGlobDispatcher", "FWFDispatcher", "JSONDispatcher", "FileDispatcher", @@ -42,5 +37,4 @@ "FeatherDispatcher", "SQLDispatcher", "ExcelDispatcher", - "ExperimentalCustomTextDispatcher", ] diff --git a/modin/experimental/core/execution/dask/implementations/pandas_on_dask/io/io.py b/modin/experimental/core/execution/dask/implementations/pandas_on_dask/io/io.py index cde2cc9f959..91367927bd2 100644 --- a/modin/experimental/core/execution/dask/implementations/pandas_on_dask/io/io.py +++ b/modin/experimental/core/execution/dask/implementations/pandas_on_dask/io/io.py @@ -25,13 +25,11 @@ ) from modin.core.storage_formats.pandas.query_compiler import PandasQueryCompiler from modin.core.execution.dask.implementations.pandas_on_dask.io import PandasOnDaskIO -from modin.core.io import ( - CSVGlobDispatcher, - ExperimentalCustomTextDispatcher, -) from modin.experimental.core.io import ( + ExperimentalCSVGlobDispatcher, ExperimentalSQLDispatcher, ExperimentalPickleDispatcher, + ExperimentalCustomTextDispatcher, ) from modin.core.execution.dask.implementations.pandas_on_dask.dataframe import ( @@ -66,7 +64,7 @@ def __make_write(*classes, build_args=build_args): # used to reduce code duplication return type("", (DaskWrapper, *classes), build_args).write - read_csv_glob = __make_read(PandasCSVGlobParser, CSVGlobDispatcher) + read_csv_glob = __make_read(PandasCSVGlobParser, ExperimentalCSVGlobDispatcher) read_pickle_distributed = __make_read( ExperimentalPandasPickleParser, ExperimentalPickleDispatcher ) diff --git a/modin/experimental/core/execution/ray/implementations/pandas_on_ray/io/io.py b/modin/experimental/core/execution/ray/implementations/pandas_on_ray/io/io.py index fb9c742384e..a4767eed276 100644 --- a/modin/experimental/core/execution/ray/implementations/pandas_on_ray/io/io.py +++ b/modin/experimental/core/execution/ray/implementations/pandas_on_ray/io/io.py @@ -25,13 +25,11 @@ ) from modin.core.storage_formats.pandas.query_compiler import PandasQueryCompiler from modin.core.execution.ray.implementations.pandas_on_ray.io import PandasOnRayIO -from modin.core.io import ( - CSVGlobDispatcher, - ExperimentalCustomTextDispatcher, -) from modin.experimental.core.io import ( + ExperimentalCSVGlobDispatcher, ExperimentalSQLDispatcher, ExperimentalPickleDispatcher, + ExperimentalCustomTextDispatcher, ) from modin.core.execution.ray.implementations.pandas_on_ray.dataframe import ( PandasOnRayDataframe, @@ -65,7 +63,7 @@ def __make_write(*classes, build_args=build_args): # used to reduce code duplication return type("", (RayWrapper, *classes), build_args).write - read_csv_glob = __make_read(PandasCSVGlobParser, CSVGlobDispatcher) + read_csv_glob = __make_read(PandasCSVGlobParser, ExperimentalCSVGlobDispatcher) read_pickle_distributed = __make_read( ExperimentalPandasPickleParser, ExperimentalPickleDispatcher ) diff --git a/modin/experimental/core/execution/unidist/implementations/pandas_on_unidist/io/io.py b/modin/experimental/core/execution/unidist/implementations/pandas_on_unidist/io/io.py index 62d98b3880b..16e394bfd2f 100644 --- a/modin/experimental/core/execution/unidist/implementations/pandas_on_unidist/io/io.py +++ b/modin/experimental/core/execution/unidist/implementations/pandas_on_unidist/io/io.py @@ -27,13 +27,11 @@ from modin.core.execution.unidist.implementations.pandas_on_unidist.io import ( PandasOnUnidistIO, ) -from modin.core.io import ( - CSVGlobDispatcher, - ExperimentalCustomTextDispatcher, -) from modin.experimental.core.io import ( + ExperimentalCSVGlobDispatcher, ExperimentalSQLDispatcher, ExperimentalPickleDispatcher, + ExperimentalCustomTextDispatcher, ) from modin.core.execution.unidist.implementations.pandas_on_unidist.dataframe import ( PandasOnUnidistDataframe, @@ -67,7 +65,7 @@ def __make_write(*classes, build_args=build_args): # used to reduce code duplication return type("", (UnidistWrapper, *classes), build_args).write - read_csv_glob = __make_read(PandasCSVGlobParser, CSVGlobDispatcher) + read_csv_glob = __make_read(PandasCSVGlobParser, ExperimentalCSVGlobDispatcher) read_pickle_distributed = __make_read( ExperimentalPandasPickleParser, ExperimentalPickleDispatcher ) diff --git a/modin/experimental/core/io/__init__.py b/modin/experimental/core/io/__init__.py index a566a220a60..6e281649bd0 100644 --- a/modin/experimental/core/io/__init__.py +++ b/modin/experimental/core/io/__init__.py @@ -13,10 +13,14 @@ """Experimental IO functions implementations.""" +from .text.csv_glob_dispatcher import ExperimentalCSVGlobDispatcher from .sql.sql_dispatcher import ExperimentalSQLDispatcher from .pickle.pickle_dispatcher import ExperimentalPickleDispatcher +from .text.custom_text_dispatcher import ExperimentalCustomTextDispatcher __all__ = [ + "ExperimentalCSVGlobDispatcher", "ExperimentalSQLDispatcher", "ExperimentalPickleDispatcher", + "ExperimentalCustomTextDispatcher", ] diff --git a/modin/experimental/core/io/text/__init__.py b/modin/experimental/core/io/text/__init__.py new file mode 100644 index 00000000000..a7b36189ad0 --- /dev/null +++ b/modin/experimental/core/io/text/__init__.py @@ -0,0 +1,14 @@ +# Licensed to Modin Development Team under one or more contributor license agreements. +# See the NOTICE file distributed with this work for additional information regarding +# copyright ownership. The Modin Development Team licenses this file to you under the +# Apache License, Version 2.0 (the "License"); you may not use this file except in +# compliance with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software distributed under +# the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF +# ANY KIND, either express or implied. See the License for the specific language +# governing permissions and limitations under the License. + +"""Experimental text format type IO functions implementations.""" diff --git a/modin/core/io/text/csv_glob_dispatcher.py b/modin/experimental/core/io/text/csv_glob_dispatcher.py similarity index 99% rename from modin/core/io/text/csv_glob_dispatcher.py rename to modin/experimental/core/io/text/csv_glob_dispatcher.py index c4f80007b3a..267938ff31d 100644 --- a/modin/core/io/text/csv_glob_dispatcher.py +++ b/modin/experimental/core/io/text/csv_glob_dispatcher.py @@ -11,7 +11,7 @@ # ANY KIND, either express or implied. See the License for the specific language # governing permissions and limitations under the License. -"""Module houses `CSVGlobDispatcher` class, that is used for reading multiple `.csv` files simultaneously.""" +"""Module houses `ExperimentalCSVGlobDispatcher` class, that is used for reading multiple `.csv` files simultaneously.""" from contextlib import ExitStack import csv @@ -30,7 +30,7 @@ from modin.core.io.text.csv_dispatcher import CSVDispatcher -class CSVGlobDispatcher(CSVDispatcher): +class ExperimentalCSVGlobDispatcher(CSVDispatcher): """Class contains utils for reading multiple `.csv` files simultaneously.""" @classmethod diff --git a/modin/core/io/text/custom_text_dispatcher.py b/modin/experimental/core/io/text/custom_text_dispatcher.py similarity index 100% rename from modin/core/io/text/custom_text_dispatcher.py rename to modin/experimental/core/io/text/custom_text_dispatcher.py