-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF-#5740: allow read_csv
, read_fwf
, read_table
, read_custom_text
functions be executed fully asynchronous; introduce ModinDtypes
#5713
Conversation
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
59d4af2
to
7727890
Compare
3888c41
to
a9b85fd
Compare
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
8f51eea
to
53181dd
Compare
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
RayWrapper.materialize( | ||
[partition.list_of_blocks[0] for partition in result.flatten()] | ||
) | ||
qc._modin_frame._partition_mgr_cls.get_objects_from_partitions(result.flatten()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly a refactoring change, but it's also safer to use since the function triggers the materialization via force_materialization
.
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
read_csv
function
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
""" | ||
dtypes_cache = self._dtypes | ||
if dtypes_cache is not None and not callable(dtypes_cache): | ||
dtypes_cache = dtypes_cache.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want to call copy
now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a mutable object, we can accidentally change for multiple dataframes when we planned to only for one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why have we not come across the issue till this moment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we've already come across this. @dchigarev Have you already fixed something like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example this one. Although it was only about copying, the problem is the same.
modin/experimental/core/execution/native/implementations/hdk_on_native/dataframe/dataframe.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Iaroslav Igoshev <Poolliver868@mail.ru>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
I think we should introduce the same class wrapper around a callable as was suggested in #5677 (comment). Operations with dtypes cache have a lot of potential to be optimized (like a delayed cast, partially known dtypes, lazy type-proxies, etc), so I think it's crucial to add a simple class-wrapper at the beginning that could then easily be extended for our future needs. |
Co-authored-by: Iaroslav Igoshev <Poolliver868@mail.ru>
I think the description of |
As far as I understand they're automatically grabbed from a CSV file that is generated during the documentation build: Lines 46 to 50 in 4633639
So no additional steps are required to add the variable to the docs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
The only thing that bothers me is the way we materialize in a non-async mode.
modin/core/io/file_dispatcher.py
Outdated
if not ExperimentalAsyncReadMode.get() and hasattr(query_compiler, "dtypes"): | ||
# at the moment it is not possible to use `wait_partitions` function; | ||
# in a situation where the reading function is called in a row with the | ||
# same parameters, `wait_partitions` considers that we have waited for | ||
# the end of remote calculations, however, when trying to materialize the | ||
# received data, it is clear that the calculations have not yet ended. | ||
# for example, `test_io_exp.py::test_read_evaluated_dict` is failed because of that | ||
_ = query_compiler.dtypes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, can you please raise an issue regarding this so we can investigate this behavior further? Maybe the first call somehow caches the result and then returns it when we being called for the second time...
I am still thinking of a page where we could put some notes regarding async read. Maybe here https://modin.readthedocs.io/en/stable/supported_apis/io_supported.html? |
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Added |
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
@dchigarev, any comments? |
@YarShev @dchigarev can we merge it? |
What do these changes do?
dataframe._dtypes
:callable() -> pandas.Series
. Together with PR#5677, it makes the execution ofread_csv
function completely asynchronous. Unlike an index, adtypes
can be restored from internal partitions, however, if zero pickling does not work, this will lead to additional copy operations.has_dtypes_cache
,copy_dtypes_cache
.dtypes
calculation in the constructor, as is done for hdk.._dtypes
field is now always copied if possible. Same reasoning as for the index cache.astype
function in the case of categorical data types. Turns out we don't have tests for that. However, the reproducer that was used for this change no longer works, and the error is not reproduced. Based on this, I decided to delete this section. Obviously, this is not a sufficient condition for deleting this section, only a necessary one. However, if we really need this code, then it is not clear why it is called only after the read functions, and not after all. As far as I understand, we are not sure that for one column we will have the same type among all partitions.flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
read_csv
function #5740docs/development/architecture.rst
is up-to-date