PERF: Consider whether to avoid partition.length() in the parquet dispatcher. #4743

mvashishtha · 2022-08-01T15:34:24Z

Per @YarShev here, we should probably not call partition.length() to get partition sizes in the parquet dispatcher:

Even if we have already materialized the index objects in build_index, ray.get for the already computed size may be expensive (we should check this)
If we haven't materialized the index in build_index, the length() call may be unnecessarily blocking (maybe something else will block anyway, though?)

The text was updated successfully, but these errors were encountered:

anmyachev · 2022-08-02T08:24:07Z

I found another place that block async execution (dtypes computing if there isn't cache for that):

modin/modin/core/io/file_dispatcher.py

Line 167 in adb16a1

isinstance(t, kernel_lib.CategoricalDtype) for t in query_compiler.dtypes

jbrockmendel · 2022-08-08T19:17:24Z

could we get the dtypes from the parquet file metadata and avoid the need to call compute_dtypes later?

mvashishtha · 2022-08-09T19:18:01Z

could we get the dtypes from the parquet file metadata and avoid the need to call compute_dtypes later?

@jbrockmendel AFAIK parquet files have their own types that may not necessarily correspond to the types pandas assigns in read_parquet. @pyrito, you were thinking about this too, did you ever find a case where two datasets with the same parquet types get different pandas types after read_parquet?

pyrito · 2022-08-09T19:54:01Z

@mvashishtha @jbrockmendel I was thinking about this. I haven't confirmed this but I was wondering if datetime objects would give us some trouble here, esp when we get into timezones and stuff. These have previously been a pain when I've worked with dask.

jbrockmendel · 2022-08-09T22:44:46Z

can you give an example? im guessing you're referring to pandas.DatetimeTZDtype?

pyrito · 2022-08-11T14:08:54Z

@jbrockmendel I don't have a minimum example I could show off the bat, but I was wondering if pandas.DatetimeTZDtype could cause some trouble here. I've had some problems before, but maybe the type mappings are better now between Arrow and pandas.

…atcher Signed-off-by: Myachev <anatoly.myachev@intel.com>

mvashishtha added the Performance 🚀 Performance related issues and pull requests. label Aug 1, 2022

mvashishtha mentioned this issue Aug 1, 2022

PERF-#4305: Parallelize read_parquet over row groups #4700

Merged

8 tasks

mvashishtha mentioned this issue Aug 5, 2022

PERF: pass lengths to build_partition in read_parquet #4785

Closed

pyrito added the P2 Minor bugs or low-priority feature requests label Aug 31, 2022

anmyachev mentioned this issue Sep 13, 2022

PERF-#4743: avoid partition.length() in the parquet dispatcher #4960

Closed

8 tasks

anmyachev added a commit to anmyachev/modin that referenced this issue Sep 22, 2022

PERF-modin-project#4743: avoid partition.length() in the parquet disp…

122f4d9

…atcher Signed-off-by: Myachev <anatoly.myachev@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Consider whether to avoid partition.length() in the parquet dispatcher. #4743

PERF: Consider whether to avoid partition.length() in the parquet dispatcher. #4743

mvashishtha commented Aug 1, 2022

anmyachev commented Aug 2, 2022

jbrockmendel commented Aug 8, 2022

mvashishtha commented Aug 9, 2022 •

edited

Loading

pyrito commented Aug 9, 2022

jbrockmendel commented Aug 9, 2022

pyrito commented Aug 11, 2022

PERF: Consider whether to avoid partition.length() in the parquet dispatcher. #4743

PERF: Consider whether to avoid partition.length() in the parquet dispatcher. #4743

Comments

mvashishtha commented Aug 1, 2022

anmyachev commented Aug 2, 2022

jbrockmendel commented Aug 8, 2022

mvashishtha commented Aug 9, 2022 • edited Loading

pyrito commented Aug 9, 2022

jbrockmendel commented Aug 9, 2022

pyrito commented Aug 11, 2022

mvashishtha commented Aug 9, 2022 •

edited

Loading