Improve scaling of to_pandas #2814

vnlitvinov · 2021-03-04T11:37:30Z

There are a few issues with to_pandas poor scaling:

we get partitions serially, much like in Improve scaling of from_pandas #2813, which is again a problem for Dask (but could also be a problem for Ray in multi-machine setup, where it would pull in data from other nodes serially instead of pulling them all at once). Note: this is the part to be solved in Improve scaling of to_pandas; getting all objects from partitions at once #5268
we reconstruct the dataframe by repeatedly calling pandas.concat() which in turn does a lot of memory copying due to how its internal structure is managed.

I wasn't able to find out a good sequence of incantations for Pandas to not copy blocks around during concatenation, so I believe that to improve performance we have to manually construct underlying Pandas block structure into a set of pre-allocated blocks (as we're solving a much simpler task than regular pandas.concat() - we know that we've split the dataframe perfectly when distributing, so there should not be any conflicts in column names or indices).

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2022-07-08T17:46:11Z

I wasn't able to find out a good sequence of incantations for Pandas to not copy blocks around during concatenation

Can you give an example of what you're trying to do? This may be fixed in newer pandas:

arr = np.random.randn(4, 2)
arr2 = np.random.randn(4, 3)
df = pd.DataFrame(arr)
df2 = pd.DataFrame(arr2)

res = pd.concat([df, df2], axis=1)
>>> res._mgr.nblocks
2

…pandas' Signed-off-by: Myachev <anatoly.myachev@intel.com>

vnlitvinov added Performance 🚀 Performance related issues and pull requests. pandas 🤔 Weird Behaviors of Pandas labels Mar 4, 2021

YarShev mentioned this issue Jul 29, 2021

Converting Modin dataframe to Pandas in a more time efficient way #3293

Closed

mvashishtha mentioned this issue Jul 21, 2022

PERF-#4494: Get partition widths/lengths in parallel instead of serially #4683

Draft

8 tasks

anmyachev mentioned this issue Aug 4, 2022

PERF-#5268: Call get on all partitions at once in to_pandas #4776

Merged

8 tasks

mvashishtha mentioned this issue Aug 9, 2022

PERF: __getitem__ #4779

Closed

pyrito added the P2 Minor bugs or low-priority feature requests label Aug 23, 2022

anmyachev added a commit to anmyachev/modin that referenced this issue Nov 24, 2022

PERF-modin-project#2814: Call 'get' on all partitions at once in 'to_…

bf4c0aa

…pandas' Signed-off-by: Myachev <anatoly.myachev@intel.com>

anmyachev mentioned this issue Nov 25, 2022

Improve scaling of to_pandas; getting all objects from partitions at once #5268

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve scaling of to_pandas #2814

Improve scaling of to_pandas #2814

vnlitvinov commented Mar 4, 2021 •

edited by anmyachev

Loading

jbrockmendel commented Jul 8, 2022

Improve scaling of to_pandas #2814

Improve scaling of to_pandas #2814

Comments

vnlitvinov commented Mar 4, 2021 • edited by anmyachev Loading

jbrockmendel commented Jul 8, 2022

vnlitvinov commented Mar 4, 2021 •

edited by anmyachev

Loading