PERF-#5740: allow `read_csv`, `read_fwf`, `read_table`, `read_custom_text` functions be executed fully asynchronous; introduce `ModinDtypes` #5713

anmyachev · 2023-02-27T16:30:53Z

What do these changes do?

The PR introduce new type that can be stored in dataframe._dtypes: callable() -> pandas.Series. Together with PR#5677, it makes the execution of read_csv function completely asynchronous. Unlike an index, a dtypes can be restored from internal partitions, however, if zero pickling does not work, this will lead to additional copy operations.
As with the index cache in PR#5677, this PR offers additional functions to make working with the dtypes cache easier: has_dtypes_cache, copy_dtypes_cache.
For backends that do not support this new behavior, it is enough to trigger the dtypes calculation in the constructor, as is done for hdk.
First side effect of this pull request is that ._dtypes field is now always copied if possible. Same reasoning as for the index cache.
Second side effect is if reading occurs in a context that deletes the files from which we are reading, then in some situations, given the asynchronous behavior, this will lead to the fact that by the time reading actually starts, the file may already be deleted. This, for example, affects our tests, which had to be rewritten.
Also, one of the debatable changes in the PR is the removal of the section of code that performs astype function in the case of categorical data types. Turns out we don't have tests for that. However, the reproducer that was used for this change no longer works, and the error is not reproduced. Based on this, I decided to delete this section. Obviously, this is not a sufficient condition for deleting this section, only a necessary one. However, if we really need this code, then it is not clear why it is called only after the read functions, and not after all. As far as I understand, we are not sure that for one column we will have the same type among all partitions.

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves PERF: make dtypes computation lazy for read_csv function #5740
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

This reverts commit 8492f1b. Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

This reverts commit 96dbc39.

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

anmyachev · 2023-02-28T17:45:58Z

modin/core/execution/ray/implementations/pandas_on_ray/io/io.py

-        RayWrapper.materialize(
-            [partition.list_of_blocks[0] for partition in result.flatten()]
-        )
+        qc._modin_frame._partition_mgr_cls.get_objects_from_partitions(result.flatten())


This is mostly a refactoring change, but it's also safer to use since the function triggers the materialization via force_materialization.

modin/pandas/test/test_io.py

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

…defer-dtypes

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

YarShev · 2023-03-14T15:42:56Z

modin/core/dataframe/pandas/dataframe/dataframe.py

+        """
+        dtypes_cache = self._dtypes
+        if dtypes_cache is not None and not callable(dtypes_cache):
+            dtypes_cache = dtypes_cache.copy()


Why do we want to call copy now?

This is a mutable object, we can accidentally change for multiple dataframes when we planned to only for one.

Why have we not come across the issue till this moment?

I think we've already come across this. @dchigarev Have you already fixed something like this?

For example this one. Although it was only about copying, the problem is the same.

modin/core/io/file_dispatcher.py

modin/experimental/core/execution/native/implementations/hdk_on_native/dataframe/dataframe.py

modin/experimental/pandas/test/test_io_exp.py

modin/pandas/test/test_io.py

Co-authored-by: Iaroslav Igoshev <Poolliver868@mail.ru>

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

dchigarev · 2023-03-20T11:30:48Z

I think we should introduce the same class wrapper around a callable as was suggested in #5677 (comment).

Operations with dtypes cache have a lot of potential to be optimized (like a delayed cast, partially known dtypes, lazy type-proxies, etc), so I think it's crucial to add a simple class-wrapper at the beginning that could then easily be extended for our future needs.

modin/experimental/pandas/test/test_io_exp.py

modin/pandas/test/test_io.py

Co-authored-by: Iaroslav Igoshev <Poolliver868@mail.ru>

…defer-dtypes

anmyachev · 2023-04-05T13:51:35Z

Second side effect is if reading occurs in a context that deletes the files from which we are reading, then in some situations, given the asynchronous behavior, this will lead to the fact that by the time reading actually starts, the file may already be deleted. This, for example, affects our tests, which had to be rewritten.

Should we emphasize this in docs somewhere?

I think the description of ASYNC_READ_MODE variable is enough.
Do you know the place in the documentation where the environment variables are described?

dchigarev · 2023-04-05T16:19:35Z

I think the description of ASYNC_READ_MODE variable is enough.
Do you know the place in the documentation where the environment variables are described?

As far as I understand they're automatically grabbed from a CSV file that is generated during the documentation build:
https://github.com/modin-project/modin/blame/master/docs/flow/modin/config.rst#L26-L28

modin/docs/conf.py

Lines 46 to 50 in 4633639

    
           configs_file_path = os.path.abspath( 
        
               os.path.join(os.path.dirname(__file__), "flow/modin/configs_help.csv") 
        
           ) 
        
           # Export configs help to create configs table in the docs/flow/modin/config.rst 
        
           export_config_help(configs_file_path)

So no additional steps are required to add the variable to the docs.

dchigarev

LGTM!

The only thing that bothers me is the way we materialize in a non-async mode.

modin/core/dataframe/pandas/metadata/dtypes.py

dchigarev · 2023-04-05T16:46:15Z

modin/core/io/file_dispatcher.py

+        if not ExperimentalAsyncReadMode.get() and hasattr(query_compiler, "dtypes"):
+            # at the moment it is not possible to use `wait_partitions` function;
+            # in a situation where the reading function is called in a row with the
+            # same parameters, `wait_partitions` considers that we have waited for
+            # the end of remote calculations, however, when trying to materialize the
+            # received data, it is clear that the calculations have not yet ended.
+            # for example, `test_io_exp.py::test_read_evaluated_dict` is failed because of that
+            _ = query_compiler.dtypes


Interesting, can you please raise an issue regarding this so we can investigate this behavior further? Maybe the first call somehow caches the result and then returns it when we being called for the second time...

…defer-dtypes

modin/core/dataframe/pandas/metadata/dtypes.py

modin/core/io/file_dispatcher.py

YarShev · 2023-04-05T17:17:09Z

I think the description of ASYNC_READ_MODE variable is enough.
Do you know the place in the documentation where the environment variables are described?

As far as I understand they're automatically grabbed from a CSV file that is generated during the documentation build: https://github.com/modin-project/modin/blame/master/docs/flow/modin/config.rst#L26-L28

modin/docs/conf.py

Lines 46 to 50 in 4633639

configs_file_path = os.path.abspath(

os.path.join(os.path.dirname(__file__), "flow/modin/configs_help.csv")

)

# Export configs help to create configs table in the docs/flow/modin/config.rst

export_config_help(configs_file_path)

So no additional steps are required to add the variable to the docs.

I am still thinking of a page where we could put some notes regarding async read. Maybe here https://modin.readthedocs.io/en/stable/supported_apis/io_supported.html?

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

anmyachev · 2023-04-05T17:44:14Z

I think the description of ASYNC_READ_MODE variable is enough.
Do you know the place in the documentation where the environment variables are described?

As far as I understand they're automatically grabbed from a CSV file that is generated during the documentation build: https://github.com/modin-project/modin/blame/master/docs/flow/modin/config.rst#L26-L28

modin/docs/conf.py

Lines 46 to 50 in 4633639

configs_file_path = os.path.abspath(

os.path.join(os.path.dirname(__file__), "flow/modin/configs_help.csv")

)

# Export configs help to create configs table in the docs/flow/modin/config.rst

export_config_help(configs_file_path)

So no additional steps are required to add the variable to the docs.

I am still thinking of a page where we could put some notes regarding async read. Maybe here https://modin.readthedocs.io/en/stable/supported_apis/io_supported.html?

Added

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

YarShev · 2023-04-05T19:52:43Z

@dchigarev, any comments?

anmyachev · 2023-04-05T21:43:24Z

@YarShev @dchigarev can we merge it?

PERF-#0000: make dtypes computation lazy for read_csv function

7727890

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

anmyachev force-pushed the defer-dtypes branch from 59d4af2 to 7727890 Compare February 27, 2023 16:36

anmyachev added 6 commits February 27, 2023 19:47

remove astype call

8492f1b

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

Revert "remove astype call"

96dbc39

This reverts commit 8492f1b. Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

Revert "Revert "remove astype call""

10638f6

This reverts commit 96dbc39.

try get_objects_from_partitions

b6c5166

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

fixes

3b97976

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

just try

a9b85fd

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

anmyachev force-pushed the defer-dtypes branch from 3888c41 to a9b85fd Compare February 28, 2023 00:40

anmyachev added 2 commits February 28, 2023 15:19

add max_retries param for to_parquet

e8bbd05

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

fixes

53181dd

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

anmyachev force-pushed the defer-dtypes branch from 8f51eea to 53181dd Compare February 28, 2023 14:59

anmyachev added 2 commits February 28, 2023 18:20

revert debug changes

ab4d54e

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

add comments

e6b3646

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

anmyachev commented Feb 28, 2023

View reviewed changes

github-advanced-security bot found potential problems Feb 28, 2023

View reviewed changes

modin/pandas/test/test_io.py Fixed Show fixed Hide fixed

anmyachev added 3 commits March 1, 2023 00:29

just check

d25e65d

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

Merge branch 'master' of https://github.com/modin-project/modin into …

240ad92

…defer-dtypes

allow '_dtypes' be callable

b2ee1f9

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

anmyachev changed the title ~~PERF-#0000: make dtypes computation lazy for read_csv function~~ PERF-#5740: make dtypes computation lazy for read_csv function Mar 2, 2023

add docstrings

ed9fb05

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

anmyachev marked this pull request as ready for review March 2, 2023 14:32

anmyachev requested review from a team as code owners March 2, 2023 14:32

anmyachev added the Ready for review label Mar 2, 2023

YarShev reviewed Mar 14, 2023

View reviewed changes

anmyachev and others added 3 commits March 14, 2023 20:03

Update modin/experimental/pandas/test/test_io_exp.py

54bba5a

Co-authored-by: Iaroslav Igoshev <Poolliver868@mail.ru>

update comments in other places

6907447

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

fixes

fcca607

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>