Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move more logic into Axis type? #15

Open
mbauman opened this issue Mar 5, 2015 · 6 comments
Open

Move more logic into Axis type? #15

mbauman opened this issue Mar 5, 2015 · 6 comments

Comments

@mbauman
Copy link
Member

mbauman commented Mar 5, 2015

Now that the axis information is stored as proper Axis types, one possible way to both allow an unchecked constructor and fix #14 would be to move the sort check into the Axis constructor. We could even go further, and make Axis an abstract super type of all these traits, with the Axis constructor returning the appropriate type. This could include, for example, a SortedDimensionalAxis type.

This would, of course, make the abstract Axis constructor type-unstable. It also moves more of the axis trait complexity into the foreground.

@mbauman
Copy link
Member Author

mbauman commented Mar 5, 2015

Another way we could simplify things by making Axis objects more capable would be to make them an <: AbstractVector over their contents. The downside here is that Axis is also used for indexing identification, which don't need to be vectors. Scalars behave differently from Vectors and Intervals can't even support getindex. Perhaps this means we're punning too much on Axis.

@mbauman
Copy link
Member Author

mbauman commented Sep 15, 2016

I still like the idea of making the Axis type do the validation of the axis vectors — like ensuring that dimensional axes are sorted. But everything beyond the first sentence of this post reads like gibberish to me now.

@mbauman mbauman mentioned this issue Sep 15, 2016
14 tasks
@timholy
Copy link
Member

timholy commented Sep 15, 2016

Yes, checking upon construction seems like a great idea. Supporting unsorted axes seems like a nightmare. issorted is pretty cheap (especially for Ranges 😉).

But that raises a question: do we allow any monotonic sequence, or only increasing? I'm fine with purely increasing.

@mbauman
Copy link
Member Author

mbauman commented Sep 20, 2016

do we allow any monotonic sequence, or only increasing? I'm fine with purely increasing.

I had initially restricted this to purely increasing, but we decided relatively early on that monotonicity was sufficient. Part of what enables this is that we error upon scalar indexing by a duplicated axis value. currently don't allow indexing by scalar axis value for dimensional axes, so there's never a question of what to do for repeated axis values. Even if we decide to eventually allow scalar indexing, we could simply define it to return the first value, which seems quite sensible to me.

@timholy
Copy link
Member

timholy commented Sep 20, 2016

Thanks for the clarification, I hadn't thought of the repeated value issue. Can you point me to some background? Esp. that might clarify what "use" repeated indices have?

I should add a clarification of my own: my real question was regarding monotonically decreasing sequences. I don't see the need to support them personally, but the basic concept of fast searching applies to any monotonic sequence, so I thought I'd ask.

@mbauman
Copy link
Member Author

mbauman commented Sep 20, 2016

I'm not sure how much of my thinking here made it onto public venues… or still remains in my head for that matter (see my edit above). Here's a bit of a brain dump… which should probably be split across multiple issues.

This comes down to how I've categorized axis types into Dimensional or Categorical. It determines two things: which indexing-by-axis-value methods are supported and how lookup is performed.

Dimensional (sorted) Categorical (unique elements)
Scalar indexing only with !(T <: Real), error on dupes yes, no dupes by construction
Interval indexing yes not implemented
Vector indexing not implemented yes
Lookup methods searchsorted, searchsortedfirst findfirst, findin (should move to hashing)

So: I think a decreasing sequence could work just as well as an increasing one (assuming it implements the search interface properly), but I've not given that case much thought. At a minimum, It'll most likely break my searchsortednearest extension to the search API.


The restriction on Scalar indexing in Dimensional {T<:Real} AxisArrays is just to prevent ambiguity between location-based and axis-value-based indexing. It'd be interesting to experiment with making UnitRange-axis AxisArrays behave like an OffsetArray, but I think it'd wreck havoc with float axes and non-UnitRange integer axes. I'm concerned that the distinction between indexing a float axis by value (2.0) or integer location (1) is too subtle for general use.

It's worth noting that Panda's indexing scheme currently allows this in some regards: if an "axis" is float or integer, then by-axis-value lookup takes priority over location-based indexing in their default x[] API… but I've always found it to be surprising, and they're thinking about changing it. They're able to get away with this to some extent because of the way python objects expose the __array__ API separately from the x[] interface, and they additionally have dedicated x.loc[] and x.iloc[] methods for by-axis-value-lookup and by-location-indexing, respectively. All in all, it's fairly confusing… especially once you start considering slicing (which sometimes acts like closed intervals across the label locations).

A better example is xarray, where they have dramatically simplified this scheme: x[] is always location, x.loc[] is always axis-based. They've additionally added x.sel() and x.isel() APIs to enable specifying the axis name as a keyword argument, which is very clever, but has an unfortunate interaction with NumPy's heisenviews and assignment.


As far as repeated axis values go, this was implemented in #16 for a use-case in Sims.jl. It's not hard to check for multiple values upon scalar lookup and error. And to be honest, I was surprised at how little impact this had on the code.

Pandas allows duplicates but is "type-unstable" here, returning a scalar or vector if there are dupes. Interestingly, xarray also allow dupes and always returns xarrays from indexing… even scalar indexing by integer locations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants