-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: Consider whether to avoid partition.length() in the parquet dispatcher. #4743
Comments
I found another place that block async execution (dtypes computing if there isn't cache for that): modin/modin/core/io/file_dispatcher.py Line 167 in adb16a1
|
could we get the dtypes from the parquet file metadata and avoid the need to call compute_dtypes later? |
@jbrockmendel AFAIK parquet files have their own types that may not necessarily correspond to the types pandas assigns in |
@mvashishtha @jbrockmendel I was thinking about this. I haven't confirmed this but I was wondering if |
can you give an example? im guessing you're referring to pandas.DatetimeTZDtype? |
@jbrockmendel I don't have a minimum example I could show off the bat, but I was wondering if pandas.DatetimeTZDtype could cause some trouble here. I've had some problems before, but maybe the type mappings are better now between Arrow and pandas. |
…atcher Signed-off-by: Myachev <anatoly.myachev@intel.com>
Per @YarShev here, we should probably not call partition.length() to get partition sizes in the parquet dispatcher:
build_index
,ray.get
for the already computed size may be expensive (we should check this)build_index
, thelength()
call may be unnecessarily blocking (maybe something else will block anyway, though?)The text was updated successfully, but these errors were encountered: