Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF-#5740: allow read_csv, read_fwf, read_table, read_custom_text functions be executed fully asynchronous; introduce ModinDtypes #5713

Merged
merged 57 commits into from
Apr 6, 2023

Conversation

anmyachev
Copy link
Collaborator

@anmyachev anmyachev commented Feb 27, 2023

What do these changes do?

  • The PR introduce new type that can be stored in dataframe._dtypes: callable() -> pandas.Series. Together with PR#5677, it makes the execution of read_csv function completely asynchronous. Unlike an index, a dtypes can be restored from internal partitions, however, if zero pickling does not work, this will lead to additional copy operations.
  • As with the index cache in PR#5677, this PR offers additional functions to make working with the dtypes cache easier: has_dtypes_cache, copy_dtypes_cache.
  • For backends that do not support this new behavior, it is enough to trigger the dtypes calculation in the constructor, as is done for hdk.
  • First side effect of this pull request is that ._dtypes field is now always copied if possible. Same reasoning as for the index cache.
  • Second side effect is if reading occurs in a context that deletes the files from which we are reading, then in some situations, given the asynchronous behavior, this will lead to the fact that by the time reading actually starts, the file may already be deleted. This, for example, affects our tests, which had to be rewritten.
  • Also, one of the debatable changes in the PR is the removal of the section of code that performs astype function in the case of categorical data types. Turns out we don't have tests for that. However, the reproducer that was used for this change no longer works, and the error is not reproduced. Based on this, I decided to delete this section. Obviously, this is not a sufficient condition for deleting this section, only a necessary one. However, if we really need this code, then it is not clear why it is called only after the read functions, and not after all. As far as I understand, we are not sure that for one column we will have the same type among all partitions.
  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves PERF: make dtypes computation lazy for read_csv function #5740
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
This reverts commit 8492f1b.

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
RayWrapper.materialize(
[partition.list_of_blocks[0] for partition in result.flatten()]
)
qc._modin_frame._partition_mgr_cls.get_objects_from_partitions(result.flatten())
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly a refactoring change, but it's also safer to use since the function triggers the materialization via force_materialization.

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
@anmyachev anmyachev changed the title PERF-#0000: make dtypes computation lazy for read_csv function PERF-#5740: make dtypes computation lazy for read_csv function Mar 2, 2023
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
@anmyachev anmyachev marked this pull request as ready for review March 2, 2023 14:32
@anmyachev anmyachev requested review from a team as code owners March 2, 2023 14:32
"""
dtypes_cache = self._dtypes
if dtypes_cache is not None and not callable(dtypes_cache):
dtypes_cache = dtypes_cache.copy()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to call copy now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a mutable object, we can accidentally change for multiple dataframes when we planned to only for one.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have we not come across the issue till this moment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we've already come across this. @dchigarev Have you already fixed something like this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example this one. Although it was only about copying, the problem is the same.

modin/core/io/file_dispatcher.py Show resolved Hide resolved
modin/experimental/pandas/test/test_io_exp.py Outdated Show resolved Hide resolved
modin/experimental/pandas/test/test_io_exp.py Outdated Show resolved Hide resolved
modin/pandas/test/test_io.py Outdated Show resolved Hide resolved
anmyachev and others added 3 commits March 14, 2023 20:03
Co-authored-by: Iaroslav Igoshev <Poolliver868@mail.ru>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
@dchigarev
Copy link
Collaborator

I think we should introduce the same class wrapper around a callable as was suggested in #5677 (comment).

Operations with dtypes cache have a lot of potential to be optimized (like a delayed cast, partially known dtypes, lazy type-proxies, etc), so I think it's crucial to add a simple class-wrapper at the beginning that could then easily be extended for our future needs.

anmyachev and others added 2 commits April 5, 2023 14:02
Co-authored-by: Iaroslav Igoshev <Poolliver868@mail.ru>
@anmyachev
Copy link
Collaborator Author

Second side effect is if reading occurs in a context that deletes the files from which we are reading, then in some situations, given the asynchronous behavior, this will lead to the fact that by the time reading actually starts, the file may already be deleted. This, for example, affects our tests, which had to be rewritten.

Should we emphasize this in docs somewhere?

I think the description of ASYNC_READ_MODE variable is enough.
Do you know the place in the documentation where the environment variables are described?

@dchigarev
Copy link
Collaborator

dchigarev commented Apr 5, 2023

I think the description of ASYNC_READ_MODE variable is enough.
Do you know the place in the documentation where the environment variables are described?

As far as I understand they're automatically grabbed from a CSV file that is generated during the documentation build:
https://github.com/modin-project/modin/blame/master/docs/flow/modin/config.rst#L26-L28

modin/docs/conf.py

Lines 46 to 50 in 4633639

configs_file_path = os.path.abspath(
os.path.join(os.path.dirname(__file__), "flow/modin/configs_help.csv")
)
# Export configs help to create configs table in the docs/flow/modin/config.rst
export_config_help(configs_file_path)

So no additional steps are required to add the variable to the docs.

dchigarev
dchigarev previously approved these changes Apr 5, 2023
Copy link
Collaborator

@dchigarev dchigarev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

The only thing that bothers me is the way we materialize in a non-async mode.

Comment on lines 159 to 166
if not ExperimentalAsyncReadMode.get() and hasattr(query_compiler, "dtypes"):
# at the moment it is not possible to use `wait_partitions` function;
# in a situation where the reading function is called in a row with the
# same parameters, `wait_partitions` considers that we have waited for
# the end of remote calculations, however, when trying to materialize the
# received data, it is clear that the calculations have not yet ended.
# for example, `test_io_exp.py::test_read_evaluated_dict` is failed because of that
_ = query_compiler.dtypes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, can you please raise an issue regarding this so we can investigate this behavior further? Maybe the first call somehow caches the result and then returns it when we being called for the second time...

@YarShev
Copy link
Collaborator

YarShev commented Apr 5, 2023

I think the description of ASYNC_READ_MODE variable is enough.
Do you know the place in the documentation where the environment variables are described?

As far as I understand they're automatically grabbed from a CSV file that is generated during the documentation build: https://github.com/modin-project/modin/blame/master/docs/flow/modin/config.rst#L26-L28

modin/docs/conf.py

Lines 46 to 50 in 4633639

configs_file_path = os.path.abspath(
os.path.join(os.path.dirname(__file__), "flow/modin/configs_help.csv")
)
# Export configs help to create configs table in the docs/flow/modin/config.rst
export_config_help(configs_file_path)

So no additional steps are required to add the variable to the docs.

I am still thinking of a page where we could put some notes regarding async read. Maybe here https://modin.readthedocs.io/en/stable/supported_apis/io_supported.html?

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
@anmyachev
Copy link
Collaborator Author

I think the description of ASYNC_READ_MODE variable is enough.
Do you know the place in the documentation where the environment variables are described?

As far as I understand they're automatically grabbed from a CSV file that is generated during the documentation build: https://github.com/modin-project/modin/blame/master/docs/flow/modin/config.rst#L26-L28

modin/docs/conf.py

Lines 46 to 50 in 4633639

configs_file_path = os.path.abspath(
os.path.join(os.path.dirname(__file__), "flow/modin/configs_help.csv")
)
# Export configs help to create configs table in the docs/flow/modin/config.rst
export_config_help(configs_file_path)

So no additional steps are required to add the variable to the docs.

I am still thinking of a page where we could put some notes regarding async read. Maybe here https://modin.readthedocs.io/en/stable/supported_apis/io_supported.html?

Added

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
@YarShev
Copy link
Collaborator

YarShev commented Apr 5, 2023

@dchigarev, any comments?

@anmyachev
Copy link
Collaborator Author

@YarShev @dchigarev can we merge it?

@YarShev YarShev merged commit 717acf8 into modin-project:master Apr 6, 2023
@anmyachev anmyachev deleted the defer-dtypes branch April 6, 2023 11:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: make dtypes computation lazy for read_csv function
3 participants