Question: Guaranteed zero-copy round-trip from numpy? #27211

amueller · 2019-07-03T15:44:19Z

This is for informing a scikit-learn design decision, I had briefly talked with @jorisvandenbossche about this a bit ago.

The question is whether we can rely on having zero-copy wrapping and unwrapping of numpy arrays into pandas dataframes, i.e. is it future proof to assume something like

X = np.array(...)
X_df = pd.DataFrame(X)
X_again = np.asarray(X_df)

doesn't result in a copy of the data and X_again shares the memory of X?

Context: We want to attach some meta-data to our numpy arrays, in particular I'm interested in column names. Pandas is an obvious candidate for doing that, but core sklearn works on numpy arrays.
So if we want to use pandas, we need to make sure that there's no overhead in wrapping and unwrapping.
And this is a design decision that's very hard to undo, so I want to make sure that it's reasonably future-proof.

@jorisvandenbossche had mentioned that there were thoughts about making pandas a column store, which sounds like it would break the zero copy requirement.

Thanks!

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-07-03T15:56:05Z

Today, this is True

In [23]: X = np.random.randint(0, 10, size=(10, 2))

In [24]: pd.DataFrame(X)._data.blocks[0].values.base is X
Out[24]: True

but for better or worse it's possible that a future refactor will change that. We have a long-standing desire to simplify pandas internals, part of which may require storing a DataFrame as a collection of 1D arrays. Those 1D arrays would be a view on X, but I don't think there'd be a zero-copy array to go from the slices back to a new ndarray with np.asarray(DataFrame).

amueller · 2019-07-03T16:01:31Z

I know it's true right now ;)

The question might be how likely the change is and on what timeframe. But it sounds like it is a bad idea for us to bank on this staying the same, right?
It totally makes sense for pandas to do that but is a bit unfortunate for us.

xarray would be another option but it seems a bit weird to produce xarrays if the user inputs pandas dataframes. We could also output pandas dataframes until pandas changes the internal structure and then change to xarray? But that all introduces a bunch of uncertainties.

amueller · 2019-07-03T16:21:26Z

This makes me a bit sad, because it means my dream of a "pandas in, pandas out" scikit-learn seems unrealistic unless we accept numerous avoidable data copies.

jorisvandenbossche · 2019-07-03T17:34:45Z

Can't add much to what Tom already said, but: if the use case is to add metadata to numpy arrays going in, it might indeed not be very future proof to use pandas DataFrames for that if you want to avoid a copy of the data (twice, again when converting back to numpy).
But, if your data is already pandas going in, I think scikit-learn should try to prevent converting it into an array until it is fed into an algorithm that requires a 2D array. For example most preprocessors could work column-wise. That of course requires more changes in scikit-learn (and more dependence on pandas itself for doing work, which is a tricky point).

The question might be how likely the change is and on what timeframe

That's always difficult to answer in open source ;)
If you would ask me and if we had proper funding, I would see that as one of the first things to work on. But realistically speaking, it will rather be a few years or maybe even never (there is clear decision on the path forward or consensus of the full team).

amueller · 2019-07-03T17:39:12Z

Hope y'all are applying to chan-zuckerberg?

Ok but it sounds like this might not be a good solution for us.
@jorisvandenbossche I agree that it might be nice, but for many estimators not converting at all would mean a completely separate code path, right?
We could try, but any multivariate stuff would still require conversion.

There could be a "feature names until you doe something multivariate" but that's also a bit weird? We should probably discuss this in our SLEP, and not here.

But I think my original question is answered in that doing wrapping and unwrapping with zero-copy is not realistic long-term.

jorisvandenbossche · 2019-07-03T17:43:49Z

We once had a discussion about having two different data structures (like a DataFrame and DataMatrix, there was one long ago) to meet such needs if we would go towards column-wise store, where a DataMatrix would be limited to a single dtype and stored as 2D array (and maybe also fixed number of columns?). It's only an idea that was floated once, so never really worked out and given the additional complexity potentially not a good idea for a project with limited resources. But it might be interesting to think about if there are specific needs.

amueller · 2019-07-03T17:50:56Z

Well, there's DataArray in xarray that we could use. Or we could add our own, because that'll be fun, right? ;)

amueller changed the title ~~Question: Guarantee zero-copy round-trip from numpy~~ Question: Guaranteed zero-copy round-trip from numpy? Jul 3, 2019

amueller mentioned this issue Jul 3, 2019

SLEP 8: Propagating feature names scikit-learn/enhancement_proposals#18

Closed

amueller mentioned this issue Jul 3, 2019

Question: Guaranteed zero-copy round-trip from numpy? pydata/xarray#3077

Closed

amueller mentioned this issue Jul 3, 2019

Pandas in, Pandas out? scikit-learn/scikit-learn#5523

Closed

amueller closed this as completed Jul 3, 2019

thomasjpfan mentioned this issue Apr 3, 2022

potentially relevant usage patterns / targets for a developer-focused API data-apis/dataframe-api#71

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Guaranteed zero-copy round-trip from numpy? #27211

Question: Guaranteed zero-copy round-trip from numpy? #27211

amueller commented Jul 3, 2019

TomAugspurger commented Jul 3, 2019

amueller commented Jul 3, 2019

amueller commented Jul 3, 2019

jorisvandenbossche commented Jul 3, 2019 •

edited

Loading

amueller commented Jul 3, 2019

jorisvandenbossche commented Jul 3, 2019

amueller commented Jul 3, 2019

Question: Guaranteed zero-copy round-trip from numpy? #27211

Question: Guaranteed zero-copy round-trip from numpy? #27211

Comments

amueller commented Jul 3, 2019

TomAugspurger commented Jul 3, 2019

amueller commented Jul 3, 2019

amueller commented Jul 3, 2019

jorisvandenbossche commented Jul 3, 2019 • edited Loading

amueller commented Jul 3, 2019

jorisvandenbossche commented Jul 3, 2019

amueller commented Jul 3, 2019

jorisvandenbossche commented Jul 3, 2019 •

edited

Loading