-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should Awkward Arrays be usable as Pandas columns? #350
Comments
I think we may have an interesting use case with sktime, but not sure if that justifies the extra maintenance burden. We want to represent a variety of time series formats, including univariate, multivariate, panel data, unequal length and unequally sampled time series data. We currently (ab)use pandas by storing entire time series as a
For time series analysis, 1) seems important. 2) would be nice to have, but not essential.
This has gone slightly off-topic, please let me know if there's a better to place to discuss this! For more info, see our condensed data container discussion here. |
The above argument by @mloning is related to an earlier enquiry #289 . Unfortunately I hadn't had the time yet to make a deep dive into the options that @jpivarski laid out in #289 to represent time indices. However, even without time indices this gist should illustrate what sktime is hoping to achieve by using awkwardarrays as pandas columns (for now assuming time is simply represented by position in the array). As @mloning mentioned, if this is the only usecase it might not justify the extra maintenance burden on you. In this case, maybe there is an option to factor the awkardarray-as-extensionarry into a separate package? |
I slightly disagree with @prockenschaub and @mloning here, since I think that time series* are a pretty important use case, that to my knowledge none of the existing data container solutions is solving particularly well. While indeed it would put the maintenance burden on you (and not on us 😃), I´d see it as a potential solution to a long-standing annoyance - the eternal search for a great family of data containers for time series* - and therefore with potential to become a "pillar of data science"... *univariate, multivariate, panel data, unequal length and unequally sampled time series data, as @mloning says. |
so, where do I vote |
(This is the vote. It's informal. ) I'm reading what you've written above and also logged into https://gitter.im/Scikit-HEP/awkward-array so we can chat in real time. |
So far, I see three things that you need: (1) time-valued data, (2) data-valued index, and (3) complex data structures. (1) Pandas has always had good handling of time-valued data (from my perspective as someone who doesn't use time-valued data much). (2) The data-valued index is, I think, the thing that sets Pandas (1d) and xarray (nd) apart from NumPy (nd). This isn't a failure of NumPy, either: it's a lower level component that handles the data in the arrays, whereas Pandas and xarray are higher level components that manage what the data means through indexing. Awkward Array has been targeting that lower-level slot, too: it's designed to handle x-y data as two arrays, rather than a unified object like a Pandas Series. I've left a placeholder in the implementation called an array's (3) Awkward Array handles complex data structures in a unique way, which can't be done in a relational-like structure such as Pandas without multiple tables (DataFrames). So the real issue here is that you want all three, you can get (1) and (2) from Pandas/xarray and (3) from Awkward, but not all together. Coming back to the original point of this thread, I'm not 100% sure that putting Awkward Arrays in Pandas DataFrames and Series will make that happen, since the data will be physically contained in those containers, but without operations that know how to use it, it's not much use. Moving conversation over to Gitter now... |
A key function in that example is |
* jupyter-books 0.7.3 no longer supports 'headers'. * Update GitHub README to reflect the focus on tutorials. * Tweak sizes and port to setup.py. * Drop test_0090 in light of #350 and the fact that it's now broken.
Awkward arrays as Pandas columns will be deprecated.The next release will present a deprecation warning when you try to use an Awkward array in Pandas (as a Series or a DataFrame column) and it will be removed in 0.3.0. The ak.pandas.df and ak.pandas.dfs functions will be combined and renamed as ak.to_pandas for consistency. The new function name already exists and the old ones will be removed in 0.3.0. |
The next release I deploy will be 0.3.0 and will not have the Awkward-as-Pandas-column feature. |
The specific issue of
That's the essential motivation for ExtensionArrays: a way for pandas and these black boxes of arrays to interact through a well-defined interface. For example, cyberpandas provides vectorized implementations of ipaddress operations to pandas. pandas doesn't need to know about the memory layout of cyberpandas (a 2D int64 ndarray) or any IP operations for this to work. Now, the interface is relatively young. Some things work and some things (as you've discovered) don't. But it is improving with each release.
I personally wouldn't recommend making general-purpose objects like AwkwardArray try to implement pandas' Extension Array interface. As you note, there are some public methods that might clash with implementations in AwkwardArray. And I've never had good experiences making base classes dynamic. I'd instead recommend a dedicated object that implements the interface. This raises some issues around putting As general point though, the extension array interface is still evolving. If you run into issues please do speak up, either here or on the pandas issue tracker! |
This issue was describing the problems involved in making Awkward arrays subclasses of the Pandas ExtensionArray, particularly as dynamic subclasses, and justifying the decision to drop this original design requirement. The difficulties encountered and gaps in usefulness once implemented are surmountable technical problems, but for the stability of the Awkward Array library, I had to remove the dynamic subclassing. In the future, it would be great if we could make Awkward arrays into Pandas columns through a loose coupling like In particular, I'm interested in getting this to work on cuDF, which is introducing ListDtype and StructDtype into its data model and is backed by Apache Arrow. (See #359.) If these column types are not black boxes but something that is understood by the dataframe class, then it could make sense to introduce non-NumPy, non-Pandas functions like ak.cartesian to the dataframe, implemented by Awkward. Showing that this is a usable interface, sensible for analysis, on a specialized dataframe like cuDF (which is only implemented for GPUs) would make a good argument for bringing it to Pandas and Dask DataFrame, showing that the changes required to make that work are justified. Or maybe by implementing it in cuDF, we might find that the first draft of an interface is wrong and needs to be tweaked. By the way, the above is completely my own aspirations, not a formal plan. I've been talking with the cuDF developers on their Slack, but only in the sense of floating this idea. |
This was one of the design goals described in the original motivations document, but it has required some non-intuitive sorcery to implement and it's not clear to me that it's a valuable feature. To be clear, we're talking about
and not
The explicit conversion into a MultiIndex DataFrame with ak.pandas.df has no issues: the implementation is straightforward and I know how I would use it—there are plenty of Pandas functions for dealing with MultiIndex. For example,
But for the Awkward-in-Pandas, the only things I know of that can be used directly are ufuncs:
but not all ufuncs, for some Pandas reason:
Presumably, we could narrow in on that reason and get it to work, but there are a lot of Pandas functions to test. The fundamental problem is that Awkward objects are "black boxes" to Pandas. Sure, we can put them in a DataFrame, but what's Pandas going to do with them once they're there?
There are other downsides to making Awkward Arrays subclasses of
pandas.core.arrays.base.ExtensionArray
(so that they can be columns). For one thing, it implies that we have toimport pandas
at startup, which can cost up to a second on slow machines or might try to import a broken installation of Pandas even if the user isn't planning on using Pandas. (If Pandas is not installed, we can change the class hierarchy, but that meansak.Array
behaves differently, depending on whether you've installed Pandas, even if you're not using it.)To avoid the above, the current implementation only makes
ak.Array
inherit frompandas.core.arrays.base.ExtensionArray
if you try to use it in Pandas, which can be detected by a call todtype
. But for consistency, that's even worse, since the inheritance ofak.Array
now changes at runtime, depending on whether you've ever tried to use an Awkward Array in a DataFrame. This came up in a difference in behavior (reported on Slack) that I couldn't reproduce at first because my test didn't invoke Pandas. Namely, thepandas.core.arrays.base.ExtensionArray
defines some methods, and these methods exist or don't exist onak.Array
unless they're overshadowed by my own implementations. At the very least, I should overshadow all the non-underscored ones so that their existence is not history-dependent, but it fills up theak.Array
namespace with names I don't necessarily want.to_numpy
: This would be fine; it would call ak.to_numpy, though the other methods don't have an underscore, such astolist
(for consistency with NumPy).dtype
: Already tricky, since Pandas requires a new one,AwkwardDType
, and Dask requiresnp.dtype("O")
.shape
: Pandas needs this to be one-dimensional, which is misleading for an Awkward Array. Preferably, Awkward Arrays would have noshape
at all; the combineddtype
andshape
can only be fully captured by ak.type.ndim
: Much likeshape
, it's misleading for this to always be1
.nbytes
: This is fine, and other libraries expect such a property, too.astype
: This was the surprise that triggered this issue: I didn't think Awkward Arrays had anastype
, since it's not clear what it should mean. For changing numeric types, there's an open PR Operation to change the number type #346, but it's a new function since it doesn't change the whole type of the array, it descends to the leaves where the numbers are.isna
: This can go to ak.is_none, though "na" is not how we refer to missing data.argsort
: This can go to ak.argsort.fillna
: This can go to ak.fill_none, but see the note onisna
above.dropna
: We don't have anak.drop_none
, but such a thing wouldn't be too hard to write.shift
: This one only makes sense for rectangular tables. (See the definition.)unique
: We don't have anak.unique
and there could be some subtitles there. We don't have a definition for record equality, for example, and string equality is already handled through a behavioral extension.searchsorted
: Only makes sense if the data are actually sorted. Should there be anaxis=1
version of this for variable-length lists? Usually, physics events are unsorted but the particles (axis=1
) are sorted bypT
.factorize
: This is a non-intuitive name, but it could be good to have an Awkward function that turns arrays into an IndexedArray of unique values. But for complex objects like records, this brings up the same issues asunique
(above).repeat
: We don't have anak.repeat
, but that might be useful in some contexts. I usually find np.repeat and np.tile to be a pair that have to be used together, usually to make a Cartesian product (and we already have ak.cartesian).take
: This seems unnecessary to me, since we already have__getitem__
with integer arrays.copy
: I don't know if we have a high-level "copy" function, but we have the low-level ones to link it up.view
: This wouldn't make much sense for an Awkward Array. It's not a simple buffer.ravel
: Maybe the equivalent of this is ak.flatten? Flattening variable-length arrays, particularly ones that include records, is a different kind of thing from flattening rectilinear data.Given these mismatches, I'm strongly considering removing the Awkward-in-Pandas feature before Awkward1 actually becomes 1.0. The explicit conversion functions, ak.pandas.df and ak.pandas.dfs, would be kept.
But I might be wrong—there might be some fantastic use-case for Awkward-in-Pandas that I don't know about. This question is an informal vote on the feature. You might have been sent here by an error message, where the feature is provisionally removed with a way to opt-in. If you find it useful to include Awkward Arrays inside of Pandas DataFrames (distinct from the ak.pandas.df conversion), then say so here, describing the use-case. You can opt-in now by calling ak.pandas.register(), but if I don't hear from people saying that they really use it, the feature will be removed and you won't be able to use it past 1.0.
So let me know!
The text was updated successfully, but these errors were encountered: