Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pandas ExtensionArray for storing homogeneous ragged arrays #687

Merged
merged 48 commits into from
Mar 1, 2019

Conversation

jonmmease
Copy link
Collaborator

@jonmmease jonmmease commented Jan 13, 2019

Overview

This PR introduces a pandas ExtensionArray for storing a column of homogeneous ragged 1D arrays. The Datashader motivation for ragged arrays is to make it possible to store variable-length lines (fixing problems like #464) and eventually polygons (#181) as elements of a column in a DataFrame. Using one such shape per row makes it simpler to store associated columns of data for use with selections and filtering, hovering, etc.

This PR currently contains only the extension array and associated testing.

Implementation

RaggedArray is a subclass of pandas.api.extension.ExtensionArray with a RaggedDtype that is a subclass of pandas.api.extension.ExtensionDtype. RaggedDtype takes advantage of the @register_extension_dtype decorator introduced in pandas 0.24rc1 to register itself with pandas as a datatype named 'ragged'.

NOTE: This branch currently requires pandas 0.24rc1

A ragged array of length n is represented by three numpy arrays:

  • mask: A boolean array of length n where values of True represent missing/NA values
  • flat_array: An array with the same datatype as the ragged array element and with a length equal to the sum of the length of all of the ragged array elements.
  • start_indices: An unsigned integer array of length n of indices into flat_array corresponding to the start of the ragged array element. For space efficiency, the precision of the unsigned integer is chosen to be the smallest available that is capable of indexing the last element in flat_array.

Example Usage

In[1]: from datashader.datatypes import RaggedArray
In[2]: ra = RaggedArray([[1, 2], [], [10, 20], None, [11, 22, 33, 44]])
In[3]: ra
Out[3]: 
<RaggedArray>
[            array([1., 2.]),    array([], dtype=float64),
           array([10., 20.]),                        None,
 array([11., 22., 33., 44.])]
Length: 5, dtype: <class 'datashader.datatypes.RaggedDtype'>

In[4]: ra.flat_array
Out[4]: array([ 1.,  2., 10., 20., 11., 22., 33., 44.])

In[5]: ra.start_indices
Out[5]: array([0, 2, 2, 4, 4], dtype=uint8)

In[6]: ra.mask
Out[6]: array([False, False, False,  True, False])

In[7]: pd.array([[1, 2], [], [10, 20], None, [11, 22, 33, 44]], dtype='ragged')
Out[7]: 
<RaggedArray>
[            array([1., 2.]),    array([], dtype=float64),
           array([10., 20.]),                        None,
 array([11., 22., 33., 44.])]
Length: 5, dtype: <class 'datashader.datatypes.RaggedDtype'>

In[8]: rs = pd.Series([[1, 2], [], [10, 20], None, [11, 22, 33, 44]], dtype='ragged')
In[9]: rs
Out[9]: 
0              [1. 2.]
1                   []
2            [10. 20.]
3                 None
4    [11. 22. 33. 44.]
dtype: ragged

In[10]: ragged_subset = rs.loc[[0, 1, 4]]
In[11]: ragged_subset
Out[11]: 
0              [1. 2.]
1                   []
4    [11. 22. 33. 44.]
dtype: ragged

In[12]: ragged_subset.array.mask
Out[12]: array([False, False, False])

In[13]: ragged_subset.array.flat_array
Out[13]: array([ 1.,  2., 11., 22., 33., 44.])

In[14]: ragged_subset.array.start_indices
Out[14]: array([0, 2, 2], dtype=uint8)

# This is a workaround (hack?) to keep pandas.lib.infer_dtype from
# "raising cannot infer type" ValueError error when calling:
# >>> pd.Series([[0, 1], [1, 2, 3]], dtype='ragged')
self._values = self._flat_array
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hack to work around ValueError: cannot infer type for <class 'NoneType'> in pandas._libs.lib.infer_dtype

@jonmmease
Copy link
Collaborator Author

Hi @TomAugspurger, I was wondering if you'd have a little time to look this over. Any feedback/thoughts you have on the extension array approach would be appreciated, but there are two specific things I wanted to ask you about that I think are related to pandas 0.24rc1.

  1. It looks like Datashader had been using categorical_series.cat.categorical.ordered to check for an ordered categorical and it seems the categorical property of the CategoricalAccessor has been removed in 0.24rc1. Changing this to categorical_series.cat.ordered works fine and is more direct, but I just wanted to double check whether this change in pandas was intentional.

  2. I'm using the new register_extension_dtype decorator to register the RaggedDtype with pandas as 'ragged'. And this makes it possible to construct the ragged array using the new pd.array constructor. e.g.:

pd.array([[1, 2], [], [10, 20], None, [11, 22, 33, 44]], dtype='ragged')

<RaggedArray>
[            array([1., 2.]),    array([], dtype=float64),
           array([10., 20.]),                        None,
 array([11., 22., 33., 44.])]
Length: 5, dtype: <class 'datashader.datatypes.RaggedDtype'>

However, I ran into an error when trying to do the same with a pandas Series

pd.Series([[1, 2], [], [10, 20], None, [11, 22, 33, 44]], dtype='ragged')

Traceback (most recent call last):
  File "pandas/_libs/lib.pyx", line 1201, in pandas._libs.lib.infer_dtype
TypeError: Cannot convert RaggedArray to numpy.ndarray
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/anaconda3/envs/datashader_dev/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3267, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-4-3fd52f24dd69>", line 1, in <module>
    pd.Series([[1, 2], [], [10, 20], None, [11, 22, 33, 44]], dtype='ragged')
  File "/anaconda3/envs/datashader_dev/lib/python3.6/site-packages/pandas/core/series.py", line 262, in __init__
    raise_cast_failure=True)
  File "/anaconda3/envs/datashader_dev/lib/python3.6/site-packages/pandas/core/internals/construction.py", line 674, in sanitize_array
    inferred = lib.infer_dtype(subarr, skipna=False)
  File "pandas/_libs/lib.pyx", line 1208, in pandas._libs.lib.infer_dtype
ValueError: cannot infer type for <class 'NoneType'>

Looking at the source for lib.infer_dtype

https://github.com/pandas-dev/pandas/blob/48c3ce5b5b89fa53b3d4ac9c23d4c22da5e86493/pandas/_libs/lib.pyx#L1201

I surmised that I could work around the error by setting a ._values property in my RaggedDtype constructor to a numpy array. But this doesn't seem like the right thing to do. To reproduce the error, comment out this line at the bottom of the RaggedArray constructor.

Thanks! And thanks for all of your efforts on the ExtensionArray system. This is awesome work and really opens up a whole new world for pandas!

@TomAugspurger
Copy link

Opened pandas-dev/pandas#24751 for the categorical issue. That was a byproduct of a refactor (but the old way is being deprecated so datashader will want to update).

But this doesn't seem like the right thing to do.

Right. pandas should rather unbox the Series / Index to the array before attempting to infer the dtype. I'll have to look bit more closely at the infer_dtype failure. In theory, this should reproduce the failure

In [13]: from pandas.tests.extension.arrow.bool import *

In [14]: pd.array([True, False], dtype='arrow_bool')
Out[14]:
ArrowBoolArray(<pyarrow.lib.ChunkedArray object at 0x114371060>
[
  [
    true,
    false
  ]
])

In [15]: pd.Series([True, False], dtype='arrow_bool')
Out[15]:
0     True
1    False
dtype: arrow_bool

It's possible that the raggedness of the input is sending things down a different code path (but it shouldn't, when a dtype is specified).

Copy link

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you see this docs section on testing: http://pandas-docs.github.io/pandas-docs-travis/extending.html#testing-extension-arrays

Most of the interface functionality will be tested for you, if you inherit from these base test classes, and provide a few fixtures.

I'll take a closer look at the implementation tomorrow. Please let me know if you ran into any difficulties implementing the interface. Your feedback would be very valuable.

datashader/datatypes.py Outdated Show resolved Hide resolved
@jonmmease
Copy link
Collaborator Author

Thanks for pointing out the testing docs, I had missed those. I’ll implement these this evening. Thanks a lot for your help!

@jonmmease
Copy link
Collaborator Author

@TomAugspurger Ok, I added the test suite provided by pandas and that helped a lot. I'm not exactly sure what made the difference, but once I got the pandas test suite passing I didn't need the ._values hack to get my original tests passing.

One change that might have been the culprit was that the dtype property on my ExtensionArray subclass was returning my ExtensionDtype class, rather than an instance of the class. And the test suite helped me find that 🎉

I would definitely appreciate any other comments you have on the extension classes, but I think I've made it through what I was stuck on. Thanks again!

@TomAugspurger
Copy link

Just to be clear, there are a whole bunch of tests you can inherit, not just BaseDtypeTests. See e.g. https://github.com/pandas-dev/pandas/blob/master/pandas/tests/extension/decimal/test_decimal.py

Copy link

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. I'm curious to see what the other tests turn up. Some may fail through no fault of your own, as the base tests will often create an object-dtype ndarray for the "expected" result, and NumPy / pandas will likely fail to handle the shape correctly.

The EA docs recommend overriding a few methods for performance:

* fillna
* dropna
* unique
* factorize / _values_for_factorize
* argsort / _values_for_argsort
* searchsorted

datashader/datatypes.py Show resolved Hide resolved
datashader/datatypes.py Outdated Show resolved Hide resolved
datashader/datatypes.py Outdated Show resolved Hide resolved
datashader/datatypes.py Show resolved Hide resolved
datashader/datatypes.py Outdated Show resolved Hide resolved
datashader/datatypes.py Outdated Show resolved Hide resolved
datashader/datatypes.py Show resolved Hide resolved
@TomAugspurger
Copy link

It'll be good to add this to the ecosystem page at http://pandas.pydata.org/pandas-docs/version/0.24/ecosystem.html#extension-data-types once this is ready.

@TomAugspurger
Copy link

FYI, I'll probably ask you to be my guinea pig on dask/dask#4379 (adding ExtensionArray support to dask.dataframe) once that's ready :)

@jonmmease
Copy link
Collaborator Author

@TomAugspurger Ok, cool. Thanks for pointing out the additional test cases that are available. I'll try them out and report back. I'll also look over the performance optimization overrides and see what makes sense there. I'd also like to work out a way to avoid converting everything to tuples for factorization/sorting.

Regarding testing dask/dask#4379, definitely! I was actually just starting to look at what it would take to get this working in Dask, and it didn't seem straightforward 🙂 I'd be happy to try this out when you say the word.

@TomAugspurger
Copy link

TomAugspurger commented Jan 14, 2019

I'd also like to work out a way to avoid converting everything to tuples for factorization/sorting.

Yes, I'll try to think of something here has well. Does Datashader have a hard dependency on numba? That may be helpful here.

@jonmmease
Copy link
Collaborator Author

Yes, datashader has a hard dependency on numba.

One approach I was considering was to have _values_for_factorize return a list of instances of a lightweight class that holds a reference to the flat_array, a start index, and a length. These classes would assume that flat_array is immutable during their lifetime and would provide __hash__, __lt__, etc.

@jbednar
Copy link
Member

jbednar commented Jan 14, 2019

Not sure about sorting either. How do you define it? Lexicographic?

I can't think of any meaningful way to sort a ragged array column; seems like they will generally be sorted based on other columns (e.g. by a name column). So whatever is most convenient to support for sorting seems fine to me; it doesn't seem like something that would get used often.

from start_indices, flat_array, and mask arrays
@jbednar
Copy link
Member

jbednar commented Jan 14, 2019

Looks good to me! A few questions:

  • Can it accept the NaN-separated data currently used for variable-length data in Datashader as an input?
  • Is the flattened representation usable directly like the current NaN-separated data, or does that code need updating now?
  • Does the mask array really need to be a separate data structure, or could masking be done with a special index value to avoid the separate array and handling of it? I can't immediately see how to support missing values without the mask, but it just seems like there should be a way to avoid it.
  • How easy will it be to have this code graduate out of Datashader and into a standalone project later, once we are happy with it? Ragged array support seems useful for a much wider audience than just Datashader, e.g. for any GeoPandas user.

@jonmmease
Copy link
Collaborator Author

Can it accept the NaN-separated data currently used for variable-length data in Datashader as an input?

Not at the moment, but I don't think it would be too hard to add an alternative RaggedArray.from_nan_separated static factory method. Probably with something like a consecutive_nans property that would control whether back-to-back nans should be interpreted as None, [], or if they should be treated as a single nan

Is the flattened representation usable directly like the current NaN-separated data, or does that code need updating now?

The code will need to be updated at least a little bit since this flattened representation won't have the nans in it. The general flow of scanning all the way down a pair of long arrays will be the same, we'll just need an alternative check for when to break the line.

Does the mask array really need to be a separate data structure, or could masking be done with a special index value to avoid the separate array and handling of it?

Here are two approaches I considered that don't include a separate mask.

  1. Use a special sentinel value in the index array to represent missing. Since I'm using unsigned integers for the start_indices array, this sentinel value could be the max value for the unsigned integer type. This would work, the one downside is that looking up ragged element slices from the flattened array becomes more complicated. With this approach, it's no longer possible to determine the end index of a slice by looking at the next element of start_indices since that element might be the sentinel value. Looking up a slice would involve iterating forward along the start_indicies array until finding a non-sentinel value. It's probably not likely, but the worst case for this scenario would essentially turn the single linear scan of the flattened array into an N^2 scan.

  2. Rather than a sentinel value, represent missing values as the start_index of the next element times -1. This would get rid of the worst-case described above. But it would require changing the type of the start_indices array from an unsigned to a signed integer, doubling the storage requirement of the start_indices array.

So I was concerned about hitting the worst case in scenario (1), and scenario (2) felt like a hack that would generally be less space efficient than adding a 1-byte mask array.

How easy will it be to have this code graduate out of Datashader and into a standalone project later, once we are happy with it? Ragged array support seems useful for a much wider audience than just Datashader, e.g. for any GeoPandas user.

So far it would be very easy. We just need to avoid introducing dependencies back on the rest of the Datashader project.

@jbednar
Copy link
Member

jbednar commented Jan 14, 2019

The code will need to be updated at least a little bit since this flattened representation won't have the nans in it.

What are the pros and cons of including the nans in it?

For the mask, would it not be sufficient to have a zero-length array (start index same as the following start index) to indicate a missing element? Or are you concerned that it's important to be able to distinguish between "missing ragged item" and "empty ragged item"?

@jonmmease
Copy link
Collaborator Author

What are the pros and cons of including the nans in it?

  • Pro: Datashader could use it without modification 🙂
  • Con: I think the only con is the storage of one extra scalar per ragged element (Even with the nan's there I think we would still need something like start_indices to efficiently locate the start of a particular element). There's also the question of what to do with integers, although perhaps Datashader doesn't need these.

For the mask, would it not be sufficient to have a zero-length array to indicate a missing element? Or are you concerned that it's important to be able to distinguish between "missing ragged item" and "empty ragged item"?

Yeah, I was assuming the need to distinguish between empty and missing. Though I suppose Datashader doesn't really need this distinction for lines/polygons. Without this distinction, you're right that start_indices could encode missing/empty without a mask.

@jonmmease
Copy link
Collaborator Author

RaggedArray line rendering rendering added!

import pandas as pd
import numpy as np
from datashader import Canvas
import datashader.transfer_functions as tf
import datashader as ds

df_ragged = pd.DataFrame({
   'A1': pd.array([
       [1, 1.5], [2, 2.5, 3], [1.5, 2, 3, 4], [3.2, 4, 5]],
       dtype='Ragged[float32]'),
   'B1': pd.array([
       [10, 12], [11, 14, 13], [10, 7, 9, 10], [7, 8, 12]],
       dtype='Ragged[float32]'),
   'group': pd.Categorical([0, 1, 2, 1])
})
df

screen shot 2019-02-08 at 8 44 24 am

Set axis=1 to aggregate one variable length line per row

agg = cvs.line(df_ragged, x='A1', y='B1', axis=1)
tf.spread(tf.shade(agg))

download

Now use count_cat aggregation on the group column to color the variable length line segments

agg = cvs.line(df_ragged, x='A1', y='B1', agg=ds.count_cat('group'), axis=1)
tf.spread(tf.shade(agg))

download 1

In terms of documentation, the two examples above have been added to the Canvas.line docstring, but I have not made any updates to the documentation notebooks.

@jbednar Pending CI tests, this is ready for review.

cc @TomAugspurger

@jonmmease jonmmease changed the title [WIP] Add pandas ExtensionArray for storing homogeneous ragged arrays Add pandas ExtensionArray for storing homogeneous ragged arrays Feb 19, 2019
No reason to skip every combination, and this was causing pytest-xdist
to throw an internal error when running tests in parallel
@jonmmease
Copy link
Collaborator Author

Added optimized Dask auto-range calculation logic consistent with #717

Copy link
Member

@jbednar jbednar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Presumably needs rebasing and merge resolution before merging, and I have some minor comments, but I'm happy to merge it after that.

datashader/datatypes.py Outdated Show resolved Hide resolved
Newly introduced missing values are filled with
``self.dtype.na_value``.

.. versionadded:: 0.24.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.24.0 is a pandas version, not datashader, right? Presumably not supposed to be listed here or elsewhere below?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 1538909

datashader/datatypes.py Outdated Show resolved Hide resolved
datashader/datatypes.py Outdated Show resolved Hide resolved
Returns
-------
uniques : ExtensionArray
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer not to copy docstrings unmodified from the parent class, whether trivial (as here) or complicated (below). Basically, if the parent class defines the semantics, I want the reader to refer to the parent class, not to this possibly outdated copy of the docstring; that way people know to go find it in the parent, rather than thinking this actually covers everything. Conversely, if there is a docstring here, I think it should be customized to just be about RaggedArray.

Copy link
Collaborator Author

@jonmmease jonmmease Feb 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 1538909

Returns
-------
filled : ExtensionArray with NA/NaN filled
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring seems just copied from the parent class, but if there are differences in behavior from ExtensionArray, please describe those here and refer to the parent class for anything else.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 1538909

datashader/datatypes.py Outdated Show resolved Hide resolved
datashader/datatypes.py Outdated Show resolved Hide resolved
@@ -123,17 +123,26 @@ class RaggedDtype(ExtensionDtype):

@property
def name(self):
"""
See docstring for ExtensionDtype.name
"""
Copy link
Member

@jbednar jbednar Feb 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With most docs/API tools, docstrings will simply be inherited as-is if you don't specify one here, so please remove these altogether unless they need to say something explicitly about how this method relates to that of the parent class. You can mention the parent class explicitly in the class docstring, once, with something like "Methods not otherwise documented here are inherited from ExtensionDtype; please see the corresponding method on that class for the docstring".

@jbednar
Copy link
Member

jbednar commented Feb 28, 2019

Looks good; thanks! Happy to merge once the merge conflict is addressed and the docstring stubs are removed as indicated above.

@jonmmease
Copy link
Collaborator Author

Thanks for the review, should be done now

@jbednar jbednar merged commit 3171d88 into master Mar 1, 2019
@jonmmease jonmmease mentioned this pull request Aug 21, 2019
@jonmmease jonmmease mentioned this pull request Oct 23, 2019
@maximlt maximlt deleted the enh_ragged branch December 25, 2021 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants