-
-
Notifications
You must be signed in to change notification settings - Fork 366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pandas ExtensionArray for storing homogeneous ragged arrays #687
Conversation
datashader/datatypes.py
Outdated
# This is a workaround (hack?) to keep pandas.lib.infer_dtype from | ||
# "raising cannot infer type" ValueError error when calling: | ||
# >>> pd.Series([[0, 1], [1, 2, 3]], dtype='ragged') | ||
self._values = self._flat_array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hack to work around ValueError: cannot infer type for <class 'NoneType'>
in pandas._libs.lib.infer_dtype
Hi @TomAugspurger, I was wondering if you'd have a little time to look this over. Any feedback/thoughts you have on the extension array approach would be appreciated, but there are two specific things I wanted to ask you about that I think are related to pandas 0.24rc1.
However, I ran into an error when trying to do the same with a pandas
Looking at the source for I surmised that I could work around the error by setting a Thanks! And thanks for all of your efforts on the |
Opened pandas-dev/pandas#24751 for the categorical issue. That was a byproduct of a refactor (but the old way is being deprecated so datashader will want to update).
Right. pandas should rather unbox the Series / Index to the array before attempting to infer the dtype. I'll have to look bit more closely at the infer_dtype failure. In theory, this should reproduce the failure In [13]: from pandas.tests.extension.arrow.bool import *
In [14]: pd.array([True, False], dtype='arrow_bool')
Out[14]:
ArrowBoolArray(<pyarrow.lib.ChunkedArray object at 0x114371060>
[
[
true,
false
]
])
In [15]: pd.Series([True, False], dtype='arrow_bool')
Out[15]:
0 True
1 False
dtype: arrow_bool It's possible that the raggedness of the input is sending things down a different code path (but it shouldn't, when a dtype is specified). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you see this docs section on testing: http://pandas-docs.github.io/pandas-docs-travis/extending.html#testing-extension-arrays
Most of the interface functionality will be tested for you, if you inherit from these base test classes, and provide a few fixtures.
I'll take a closer look at the implementation tomorrow. Please let me know if you ran into any difficulties implementing the interface. Your feedback would be very valuable.
Thanks for pointing out the testing docs, I had missed those. I’ll implement these this evening. Thanks a lot for your help! |
Something in the fixes for these tests removed the need to for the ._values hack!
@TomAugspurger Ok, I added the test suite provided by pandas and that helped a lot. I'm not exactly sure what made the difference, but once I got the pandas test suite passing I didn't need the One change that might have been the culprit was that the I would definitely appreciate any other comments you have on the extension classes, but I think I've made it through what I was stuck on. Thanks again! |
Just to be clear, there are a whole bunch of tests you can inherit, not just BaseDtypeTests. See e.g. https://github.com/pandas-dev/pandas/blob/master/pandas/tests/extension/decimal/test_decimal.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. I'm curious to see what the other tests turn up. Some may fail through no fault of your own, as the base tests will often create an object
-dtype ndarray for the "expected" result, and NumPy / pandas will likely fail to handle the shape correctly.
The EA docs recommend overriding a few methods for performance:
* fillna
* dropna
* unique
* factorize / _values_for_factorize
* argsort / _values_for_argsort
* searchsorted
- PERF: non-numeric fillna pandas-dev/pandas#20300 has what should be a fasting filllna.
- Not sure about unique / factorize in this case
- Not sure about sorting either. How do you define it? Lexicographic?
It'll be good to add this to the ecosystem page at http://pandas.pydata.org/pandas-docs/version/0.24/ecosystem.html#extension-data-types once this is ready. |
FYI, I'll probably ask you to be my guinea pig on dask/dask#4379 (adding ExtensionArray support to dask.dataframe) once that's ready :) |
@TomAugspurger Ok, cool. Thanks for pointing out the additional test cases that are available. I'll try them out and report back. I'll also look over the performance optimization overrides and see what makes sense there. I'd also like to work out a way to avoid converting everything to tuples for factorization/sorting. Regarding testing dask/dask#4379, definitely! I was actually just starting to look at what it would take to get this working in Dask, and it didn't seem straightforward 🙂 I'd be happy to try this out when you say the word. |
Yes, I'll try to think of something here has well. Does Datashader have a hard dependency on numba? That may be helpful here. |
Yes, datashader has a hard dependency on numba. One approach I was considering was to have |
I can't think of any meaningful way to sort a ragged array column; seems like they will generally be sorted based on other columns (e.g. by a name column). So whatever is most convenient to support for sorting seems fine to me; it doesn't seem like something that would get used often. |
from start_indices, flat_array, and mask arrays
Looks good to me! A few questions:
|
Not at the moment, but I don't think it would be too hard to add an alternative
The code will need to be updated at least a little bit since this flattened representation won't have the
Here are two approaches I considered that don't include a separate mask.
So I was concerned about hitting the worst case in scenario (1), and scenario (2) felt like a hack that would generally be less space efficient than adding a 1-byte mask array.
So far it would be very easy. We just need to avoid introducing dependencies back on the rest of the Datashader project. |
What are the pros and cons of including the nans in it? For the mask, would it not be sufficient to have a zero-length array (start index same as the following start index) to indicate a missing element? Or are you concerned that it's important to be able to distinguish between "missing ragged item" and "empty ragged item"? |
Yeah, I was assuming the need to distinguish between empty and missing. Though I suppose Datashader doesn't really need this distinction for lines/polygons. Without this distinction, you're right that |
# Conflicts: # datashader/glyphs.py # datashader/tests/test_pandas.py
RaggedArray line rendering rendering added! import pandas as pd
import numpy as np
from datashader import Canvas
import datashader.transfer_functions as tf
import datashader as ds
df_ragged = pd.DataFrame({
'A1': pd.array([
[1, 1.5], [2, 2.5, 3], [1.5, 2, 3, 4], [3.2, 4, 5]],
dtype='Ragged[float32]'),
'B1': pd.array([
[10, 12], [11, 14, 13], [10, 7, 9, 10], [7, 8, 12]],
dtype='Ragged[float32]'),
'group': pd.Categorical([0, 1, 2, 1])
})
df Set agg = cvs.line(df_ragged, x='A1', y='B1', axis=1)
tf.spread(tf.shade(agg)) Now use agg = cvs.line(df_ragged, x='A1', y='B1', agg=ds.count_cat('group'), axis=1)
tf.spread(tf.shade(agg)) In terms of documentation, the two examples above have been added to the @jbednar Pending CI tests, this is ready for review. |
To address AppVeyor failures
No reason to skip every combination, and this was causing pytest-xdist to throw an internal error when running tests in parallel
Added optimized Dask auto-range calculation logic consistent with #717 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Presumably needs rebasing and merge resolution before merging, and I have some minor comments, but I'm happy to merge it after that.
datashader/datatypes.py
Outdated
Newly introduced missing values are filled with | ||
``self.dtype.na_value``. | ||
|
||
.. versionadded:: 0.24.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.24.0 is a pandas version, not datashader, right? Presumably not supposed to be listed here or elsewhere below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed in 1538909
datashader/datatypes.py
Outdated
Returns | ||
------- | ||
uniques : ExtensionArray | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer not to copy docstrings unmodified from the parent class, whether trivial (as here) or complicated (below). Basically, if the parent class defines the semantics, I want the reader to refer to the parent class, not to this possibly outdated copy of the docstring; that way people know to go find it in the parent, rather than thinking this actually covers everything. Conversely, if there is a docstring here, I think it should be customized to just be about RaggedArray.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in 1538909
datashader/datatypes.py
Outdated
Returns | ||
------- | ||
filled : ExtensionArray with NA/NaN filled | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This docstring seems just copied from the parent class, but if there are differences in behavior from ExtensionArray, please describe those here and refer to the parent class for anything else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in 1538909
Co-Authored-By: jonmmease <jon.mease@gmail.com>
datashader/datatypes.py
Outdated
@@ -123,17 +123,26 @@ class RaggedDtype(ExtensionDtype): | |||
|
|||
@property | |||
def name(self): | |||
""" | |||
See docstring for ExtensionDtype.name | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With most docs/API tools, docstrings will simply be inherited as-is if you don't specify one here, so please remove these altogether unless they need to say something explicitly about how this method relates to that of the parent class. You can mention the parent class explicitly in the class docstring, once, with something like "Methods not otherwise documented here are inherited from ExtensionDtype; please see the corresponding method on that class for the docstring".
Looks good; thanks! Happy to merge once the merge conflict is addressed and the docstring stubs are removed as indicated above. |
Thanks for the review, should be done now |
Overview
This PR introduces a pandas
ExtensionArray
for storing a column of homogeneous ragged 1D arrays. The Datashader motivation for ragged arrays is to make it possible to store variable-length lines (fixing problems like #464) and eventually polygons (#181) as elements of a column in a DataFrame. Using one such shape per row makes it simpler to store associated columns of data for use with selections and filtering, hovering, etc.This PR currently contains only the extension array and associated testing.
Implementation
RaggedArray
is a subclass ofpandas.api.extension.ExtensionArray
with aRaggedDtype
that is a subclass ofpandas.api.extension.ExtensionDtype
.RaggedDtype
takes advantage of the@register_extension_dtype
decorator introduced in pandas 0.24rc1 to register itself with pandas as a datatype named'ragged'
.NOTE: This branch currently requires pandas 0.24rc1
A ragged array of length
n
is represented by three numpy arrays:mask
: A boolean array of lengthn
where values ofTrue
represent missing/NA valuesflat_array
: An array with the same datatype as the ragged array element and with a length equal to the sum of the length of all of the ragged array elements.start_indices
: An unsigned integer array of lengthn
of indices intoflat_array
corresponding to the start of the ragged array element. For space efficiency, the precision of the unsigned integer is chosen to be the smallest available that is capable of indexing the last element inflat_array
.Example Usage