Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve scaling of to_pandas #2814

Open
vnlitvinov opened this issue Mar 4, 2021 · 1 comment
Open

Improve scaling of to_pandas #2814

vnlitvinov opened this issue Mar 4, 2021 · 1 comment
Labels
P2 Minor bugs or low-priority feature requests pandas 🤔 Weird Behaviors of Pandas Performance 🚀 Performance related issues and pull requests.

Comments

@vnlitvinov
Copy link
Collaborator

vnlitvinov commented Mar 4, 2021

There are a few issues with to_pandas poor scaling:

  1. we get partitions serially, much like in Improve scaling of from_pandas #2813, which is again a problem for Dask (but could also be a problem for Ray in multi-machine setup, where it would pull in data from other nodes serially instead of pulling them all at once). Note: this is the part to be solved in Improve scaling of to_pandas; getting all objects from partitions at once #5268
  2. we reconstruct the dataframe by repeatedly calling pandas.concat() which in turn does a lot of memory copying due to how its internal structure is managed.

I wasn't able to find out a good sequence of incantations for Pandas to not copy blocks around during concatenation, so I believe that to improve performance we have to manually construct underlying Pandas block structure into a set of pre-allocated blocks (as we're solving a much simpler task than regular pandas.concat() - we know that we've split the dataframe perfectly when distributing, so there should not be any conflicts in column names or indices).

@vnlitvinov vnlitvinov added Performance 🚀 Performance related issues and pull requests. pandas 🤔 Weird Behaviors of Pandas labels Mar 4, 2021
@jbrockmendel
Copy link
Collaborator

I wasn't able to find out a good sequence of incantations for Pandas to not copy blocks around during concatenation

Can you give an example of what you're trying to do? This may be fixed in newer pandas:

arr = np.random.randn(4, 2)
arr2 = np.random.randn(4, 3)
df = pd.DataFrame(arr)
df2 = pd.DataFrame(arr2)

res = pd.concat([df, df2], axis=1)
>>> res._mgr.nblocks
2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 Minor bugs or low-priority feature requests pandas 🤔 Weird Behaviors of Pandas Performance 🚀 Performance related issues and pull requests.
Projects
None yet
Development

No branches or pull requests

3 participants