Add a SortedVector for keyed axis indexes and hierarchical indexing #16

tshort · 2015-03-05T23:18:15Z

Add a SortedVector type that assumes sorting. Also, change the Dimensional trait to allow duplicates and add indexing by value. A SortedVector{Tuple} allows hierarchical indexing similar to pandas or data.table. SortedVector hits one of the roadmap (issue #7) bullets.

This also changes axis indexing in two ways:

Duplicates are allowed. This also helps with Sims.jl because the timescale can have repeated values before and after events.
Dimensional axes can now be indexed by value. If there are duplicates, all indexes that match are returned.

Here are examples of SortedVector and hierarchical indexing:

v = SortedVector(collect([1., 10., 10:15.]))
A = AxisArray(reshape(1:16, 8, 2), v, [:a, :b])
A[Interval(8.,12.), :]
A[1., :]
A[10., :]

## Hierarchical index example with three key levels

data = reshape(1.:40., 20, 2)
v = collect(zip([:a, :b, :c][rand(1:3,20)], [:x,:y][rand(1:2,20)], [:x,:y][rand(1:2,20)]))
idx = sortperm(v)
A = AxisArray(data[idx,:], SortedVector(v[idx]), [:a, :b])
A[:b, :]
A[[:a,:c], :]
A[(:a,:x), :]
A[(:a,:x,:x), :]
A[Interval(:a,:b), :]
A[Interval((:a,:x),(:b,:x)), :]

Using an Array of Tuples might not be the most efficient storage format for this, but it will allow us to play around with this feature. For analysis of time series data in R, I use both zoo (it's like AxisArrays) and data.tables. The zoo package provides nice plotting, so I tend to use that for more structured data, and I use data.tables when I need more flexibility. My motivation for this PR is to have one package/data structure that can be used for both approaches.

milktrader · 2015-03-06T13:16:12Z

Interesting comparison between R's zoo and AxisArrays. For financial time series, the xts package extends the zoo package and is the go-to data structure for anything requiring speed.

An interesting constellation of time-related data structures is beginning to evolve here in Julia and I'm curious how they all fit together.

I'm presently experimenting with Timestamps. I just wrote this yesterday after it became clear that adding and removing data inside a blotter (financial accounting object) requires more flexibility than TimeArrays offers.

Since TimeArrays are immutable, a new one needs to be generated every time a change is made, and when changes happen frequently the overhead accumulates quickly.

Timestamps also allows for a pseudo-hetergenous data structure, where value components can be of different type. Duplicate dates are also allowed.

mbauman · 2015-03-07T23:18:11Z

Very nice. I think you've convinced me. Indexing dimensional arrays by single values does seem to make sense. The sorted vector is the perfect solution for the slow checkaxis.

I'll need to look closer at the hierarchical indexing changes. How is this different from adding another dimension?

mbauman · 2015-03-07T23:18:48Z

src/core.jl

@@ -291,7 +291,7 @@ checkaxis(ax) = checkaxis(axistrait(ax), ax)
 checkaxis(::Type{Unsupported}, ax) = nothing # TODO: warn or error?
 # Dimensional axes must be monotonically increasing
 checkaxis{T}(::Type{Dimensional}, ax::Range{T}) = step(ax) > zero(T) || error("Dimensional axes must be monotonically increasing")
-checkaxis(::Type{Dimensional}, ax) = issorted(ax, lt=(<=)) || error("Dimensional axes must be monotonically increasing")
+checkaxis(::Type{Dimensional}, ax) = issorted(ax) || error("Dimensional axes must be monotonically increasing")


s/monotonically increasing/sorted in increasing order'

According to Wikipedia, "monotonically increasing" is correct here. The prior implementation checked for "strictly increasing".

Whoops, I had those backwards in my head. Sorry about that.

mbauman · 2015-03-07T23:21:30Z

If indexing by a missing axis value returns [], then @milktrader could index his date axis with a step range without needing to worry about the missing weekend dates.

tshort · 2015-03-07T23:56:16Z

A hierarchical index is like adding additional dimensions. It's a "long" format versus the "wide" format provided by adding another axis. It makes an AxisArray act a bit like a DataFrame. It's handy for cases where you have data with duplicate values and/or missing times or features. You don't need NA's for the missing value case because those rows just arent there. If you have many classifier variables with duplicates, a table view is easier to deal with. For the missing data case, it'd be interesting to try sparse arrays with normal axes and compare the usability and performance to a non-sparse array with a hierarchical index.

mbauman · 2015-03-08T00:09:32Z

Sweet. Let's give it a shot. Merge at will!

to allow duplicates and add indexing by value. A SortedVector{Tuple} allows hierarchical indexing.

mbauman · 2015-03-09T16:52:31Z

I've been thinking some more about this hierarchical indexing. It's a very interesting way to get Ragged arrays with axes to just work. If we were to punt to an actual RaggedArray implementation for the array storage, we'd need to have some way of having separate axis vectors for every column (or row, etc) that is ragged. Which is just a mess. This, on the other hand, is pretty elegant.

I think we'll want to move away from using tuples for this, though, and use a fully custom type. I think it'd be nice for them to work almost like full-featured indices. For example, instead of an Interval((:a,:x), (:b, :x)), the intervals should go on the inside: (Interval(:a,:b), :x). It'd also be nice to be able to use : instead of omitting the tuple dimension, too - (:, :x). Finally, tuples don't (currently) pack nicely into arrays, so a custom immutable should be faster, too.

The downside is that it becomes pretty verbose: A[Hierarchical(Interval(1.0,2.0), :), :b]. Is there a better name we could use? Maybe just typealias H Hierarchical?

On the implementation side of things, I think we could only support up to three or four nested dimensions, with an Abstract Hierarchical and concrete immutable Hierarchical2{T1,T2}; a::T1; b::T2; end, and so on. The abstract constructor would dispatch to the concrete type with the correct arity.

The other point that I've been thinking about is that this just doesn't quite feel consistent enough. Is there any way that we could make these hierarchical pseudo-dimensions actually behave like first-class top-level dimensions? Otherwise, we're going to be re-implementing indexing semantics within the Hierarchical types, and it's a bit of an extra hoop for users to jump through to use the custom type.

…

We can play with all of this on the master branch. This is a wonderful direction to head in. Thanks again, Tom!

tshort · 2015-03-09T17:25:27Z

I just force pushed a rebase. Hopefully, I didn't mess things up in doing that.

@mbauman, great points! I particularly like: "Is there any way that we could make these hierarchical pseudo-dimensions actually behave like first-class top-level dimensions?" No great ideas here, but we should think about it. Generality would be nice. For example, if one had a hierarchical index with dates and stocks, sometimes, you might want to lookup by stock, and sometimes you might want lookup by date range. With the existing approach, you could have two objects, both of which point to the same data, but each has a different hierarchical index. To have this functionality built in to one axis that looks like two axes, we'd need secondary keys or database-style indexing using B-trees or bitmap indexes.

I definitely agree with moving away from tuples. I'm not sure I like turning Interval inside out. It's nice to be able to select one or more values with something like A[:z, :] or A[[:a; Interval((:b,:x), (:b,:z))], :]. (I'm not sure how those would be expressed with the H type.)

tshort · 2015-03-09T17:26:53Z

The rebase seems to be good, so I'm pulling the trigger...

Add a SortedVector for keyed axis indexes and hierarchical indexing

tshort · 2015-03-09T20:25:22Z

@mbauman, I'm coming around to your idea of moving Interval into the "inside". I think we could still retain something like the indexing I mentioned above. One option for this type is to store it like a DataFrame, where each column is a vector. When it's indexed with a single integer, we could return a tuple. That would make it easy to convert back and forth between a DataFrame.

mbauman reviewed Mar 7, 2015
View reviewed changes

Add a SortedVector type that assumes sorting. Also, change Dimensional

5f895d4

to allow duplicates and add indexing by value. A SortedVector{Tuple} allows hierarchical indexing.

tshort force-pushed the sortedvector branch from 7c8e2a9 to 5f895d4 Compare March 9, 2015 16:57

tshort added a commit that referenced this pull request Mar 9, 2015

Merge pull request #16 from mbauman/sortedvector

fee36ba

Add a SortedVector for keyed axis indexes and hierarchical indexing

tshort merged commit fee36ba into master Mar 9, 2015

tshort deleted the sortedvector branch March 9, 2015 17:27

mbauman mentioned this pull request Sep 20, 2016

Move more logic into Axis type? #15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a SortedVector for keyed axis indexes and hierarchical indexing #16

Add a SortedVector for keyed axis indexes and hierarchical indexing #16

tshort commented Mar 5, 2015

milktrader commented Mar 6, 2015

mbauman commented Mar 7, 2015

mbauman Mar 7, 2015

tshort Mar 8, 2015

mbauman Mar 8, 2015

mbauman commented Mar 7, 2015

tshort commented Mar 7, 2015

mbauman commented Mar 8, 2015

mbauman commented Mar 9, 2015

tshort commented Mar 9, 2015

tshort commented Mar 9, 2015

tshort commented Mar 9, 2015

Add a SortedVector for keyed axis indexes and hierarchical indexing #16

Add a SortedVector for keyed axis indexes and hierarchical indexing #16

Conversation

tshort commented Mar 5, 2015

milktrader commented Mar 6, 2015

mbauman commented Mar 7, 2015

mbauman Mar 7, 2015

Choose a reason for hiding this comment

tshort Mar 8, 2015

Choose a reason for hiding this comment

mbauman Mar 8, 2015

Choose a reason for hiding this comment

mbauman commented Mar 7, 2015

tshort commented Mar 7, 2015

mbauman commented Mar 8, 2015

mbauman commented Mar 9, 2015

tshort commented Mar 9, 2015

tshort commented Mar 9, 2015

tshort commented Mar 9, 2015