Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Guaranteed zero-copy round-trip from numpy? #27211

Closed
amueller opened this issue Jul 3, 2019 · 7 comments
Closed

Question: Guaranteed zero-copy round-trip from numpy? #27211

amueller opened this issue Jul 3, 2019 · 7 comments

Comments

@amueller
Copy link

amueller commented Jul 3, 2019

This is for informing a scikit-learn design decision, I had briefly talked with @jorisvandenbossche about this a bit ago.

The question is whether we can rely on having zero-copy wrapping and unwrapping of numpy arrays into pandas dataframes, i.e. is it future proof to assume something like

X = np.array(...)
X_df = pd.DataFrame(X)
X_again = np.asarray(X_df)

doesn't result in a copy of the data and X_again shares the memory of X?

Context: We want to attach some meta-data to our numpy arrays, in particular I'm interested in column names. Pandas is an obvious candidate for doing that, but core sklearn works on numpy arrays.
So if we want to use pandas, we need to make sure that there's no overhead in wrapping and unwrapping.
And this is a design decision that's very hard to undo, so I want to make sure that it's reasonably future-proof.

@jorisvandenbossche had mentioned that there were thoughts about making pandas a column store, which sounds like it would break the zero copy requirement.

Thanks!

@amueller amueller changed the title Question: Guarantee zero-copy round-trip from numpy Question: Guaranteed zero-copy round-trip from numpy? Jul 3, 2019
@TomAugspurger
Copy link
Contributor

Today, this is True

In [23]: X = np.random.randint(0, 10, size=(10, 2))

In [24]: pd.DataFrame(X)._data.blocks[0].values.base is X
Out[24]: True

but for better or worse it's possible that a future refactor will change that. We have a long-standing desire to simplify pandas internals, part of which may require storing a DataFrame as a collection of 1D arrays. Those 1D arrays would be a view on X, but I don't think there'd be a zero-copy array to go from the slices back to a new ndarray with np.asarray(DataFrame).

@amueller
Copy link
Author

amueller commented Jul 3, 2019

I know it's true right now ;)

The question might be how likely the change is and on what timeframe. But it sounds like it is a bad idea for us to bank on this staying the same, right?
It totally makes sense for pandas to do that but is a bit unfortunate for us.

xarray would be another option but it seems a bit weird to produce xarrays if the user inputs pandas dataframes. We could also output pandas dataframes until pandas changes the internal structure and then change to xarray? But that all introduces a bunch of uncertainties.

@amueller
Copy link
Author

amueller commented Jul 3, 2019

This makes me a bit sad, because it means my dream of a "pandas in, pandas out" scikit-learn seems unrealistic unless we accept numerous avoidable data copies.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jul 3, 2019

Can't add much to what Tom already said, but: if the use case is to add metadata to numpy arrays going in, it might indeed not be very future proof to use pandas DataFrames for that if you want to avoid a copy of the data (twice, again when converting back to numpy).
But, if your data is already pandas going in, I think scikit-learn should try to prevent converting it into an array until it is fed into an algorithm that requires a 2D array. For example most preprocessors could work column-wise. That of course requires more changes in scikit-learn (and more dependence on pandas itself for doing work, which is a tricky point).

The question might be how likely the change is and on what timeframe

That's always difficult to answer in open source ;)
If you would ask me and if we had proper funding, I would see that as one of the first things to work on. But realistically speaking, it will rather be a few years or maybe even never (there is clear decision on the path forward or consensus of the full team).

@amueller
Copy link
Author

amueller commented Jul 3, 2019

Hope y'all are applying to chan-zuckerberg?

Ok but it sounds like this might not be a good solution for us.
@jorisvandenbossche I agree that it might be nice, but for many estimators not converting at all would mean a completely separate code path, right?
We could try, but any multivariate stuff would still require conversion.

There could be a "feature names until you doe something multivariate" but that's also a bit weird? We should probably discuss this in our SLEP, and not here.

But I think my original question is answered in that doing wrapping and unwrapping with zero-copy is not realistic long-term.

@amueller amueller closed this as completed Jul 3, 2019
@jorisvandenbossche
Copy link
Member

We once had a discussion about having two different data structures (like a DataFrame and DataMatrix, there was one long ago) to meet such needs if we would go towards column-wise store, where a DataMatrix would be limited to a single dtype and stored as 2D array (and maybe also fixed number of columns?). It's only an idea that was floated once, so never really worked out and given the additional complexity potentially not a good idea for a project with limited resources. But it might be interesting to think about if there are specific needs.

@amueller
Copy link
Author

amueller commented Jul 3, 2019

Well, there's DataArray in xarray that we could use. Or we could add our own, because that'll be fun, right? ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants