Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a SortedVector for keyed axis indexes and hierarchical indexing #16

Merged
merged 1 commit into from
Mar 9, 2015

Conversation

tshort
Copy link
Collaborator

@tshort tshort commented Mar 5, 2015

Add a SortedVector type that assumes sorting. Also, change the Dimensional trait to allow duplicates and add indexing by value. A SortedVector{Tuple} allows hierarchical indexing similar to pandas or data.table. SortedVector hits one of the roadmap (issue #7) bullets.

This also changes axis indexing in two ways:

  • Duplicates are allowed. This also helps with Sims.jl because the timescale can have repeated values before and after events.
  • Dimensional axes can now be indexed by value. If there are duplicates, all indexes that match are returned.

Here are examples of SortedVector and hierarchical indexing:

v = SortedVector(collect([1., 10., 10:15.]))
A = AxisArray(reshape(1:16, 8, 2), v, [:a, :b])
A[Interval(8.,12.), :]
A[1., :]
A[10., :]

## Hierarchical index example with three key levels

data = reshape(1.:40., 20, 2)
v = collect(zip([:a, :b, :c][rand(1:3,20)], [:x,:y][rand(1:2,20)], [:x,:y][rand(1:2,20)]))
idx = sortperm(v)
A = AxisArray(data[idx,:], SortedVector(v[idx]), [:a, :b])
A[:b, :]
A[[:a,:c], :]
A[(:a,:x), :]
A[(:a,:x,:x), :]
A[Interval(:a,:b), :]
A[Interval((:a,:x),(:b,:x)), :]

Using an Array of Tuples might not be the most efficient storage format for this, but it will allow us to play around with this feature. For analysis of time series data in R, I use both zoo (it's like AxisArrays) and data.tables. The zoo package provides nice plotting, so I tend to use that for more structured data, and I use data.tables when I need more flexibility. My motivation for this PR is to have one package/data structure that can be used for both approaches.

@milktrader
Copy link

Interesting comparison between R's zoo and AxisArrays. For financial time series, the xts package extends the zoo package and is the go-to data structure for anything requiring speed.

An interesting constellation of time-related data structures is beginning to evolve here in Julia and I'm curious how they all fit together.

I'm presently experimenting with Timestamps. I just wrote this yesterday after it became clear that adding and removing data inside a blotter (financial accounting object) requires more flexibility than TimeArrays offers.

Since TimeArrays are immutable, a new one needs to be generated every time a change is made, and when changes happen frequently the overhead accumulates quickly.

Timestamps also allows for a pseudo-hetergenous data structure, where value components can be of different type. Duplicate dates are also allowed.

@mbauman
Copy link
Member

mbauman commented Mar 7, 2015

Very nice. I think you've convinced me. Indexing dimensional arrays by single values does seem to make sense. The sorted vector is the perfect solution for the slow checkaxis.

I'll need to look closer at the hierarchical indexing changes. How is this different from adding another dimension?

@@ -291,7 +291,7 @@ checkaxis(ax) = checkaxis(axistrait(ax), ax)
checkaxis(::Type{Unsupported}, ax) = nothing # TODO: warn or error?
# Dimensional axes must be monotonically increasing
checkaxis{T}(::Type{Dimensional}, ax::Range{T}) = step(ax) > zero(T) || error("Dimensional axes must be monotonically increasing")
checkaxis(::Type{Dimensional}, ax) = issorted(ax, lt=(<=)) || error("Dimensional axes must be monotonically increasing")
checkaxis(::Type{Dimensional}, ax) = issorted(ax) || error("Dimensional axes must be monotonically increasing")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/monotonically increasing/sorted in increasing order'

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to Wikipedia, "monotonically increasing" is correct here. The prior implementation checked for "strictly increasing".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, I had those backwards in my head. Sorry about that.

@mbauman
Copy link
Member

mbauman commented Mar 7, 2015

If indexing by a missing axis value returns [], then @milktrader could index his date axis with a step range without needing to worry about the missing weekend dates.

@tshort
Copy link
Collaborator Author

tshort commented Mar 7, 2015

A hierarchical index is like adding additional dimensions. It's a "long" format versus the "wide" format provided by adding another axis. It makes an AxisArray act a bit like a DataFrame. It's handy for cases where you have data with duplicate values and/or missing times or features. You don't need NA's for the missing value case because those rows just arent there. If you have many classifier variables with duplicates, a table view is easier to deal with. For the missing data case, it'd be interesting to try sparse arrays with normal axes and compare the usability and performance to a non-sparse array with a hierarchical index.

@mbauman
Copy link
Member

mbauman commented Mar 8, 2015

Sweet. Let's give it a shot. Merge at will!

to allow duplicates and add indexing by value. A SortedVector{Tuple}
allows hierarchical indexing.
@mbauman
Copy link
Member

mbauman commented Mar 9, 2015

I've been thinking some more about this hierarchical indexing. It's a very interesting way to get Ragged arrays with axes to just work. If we were to punt to an actual RaggedArray implementation for the array storage, we'd need to have some way of having separate axis vectors for every column (or row, etc) that is ragged. Which is just a mess. This, on the other hand, is pretty elegant.

I think we'll want to move away from using tuples for this, though, and use a fully custom type. I think it'd be nice for them to work almost like full-featured indices. For example, instead of an Interval((:a,:x), (:b, :x)), the intervals should go on the inside: (Interval(:a,:b), :x). It'd also be nice to be able to use : instead of omitting the tuple dimension, too - (:, :x). Finally, tuples don't (currently) pack nicely into arrays, so a custom immutable should be faster, too.

The downside is that it becomes pretty verbose: A[Hierarchical(Interval(1.0,2.0), :), :b]. Is there a better name we could use? Maybe just typealias H Hierarchical?

On the implementation side of things, I think we could only support up to three or four nested dimensions, with an Abstract Hierarchical and concrete immutable Hierarchical2{T1,T2}; a::T1; b::T2; end, and so on. The abstract constructor would dispatch to the concrete type with the correct arity.

The other point that I've been thinking about is that this just doesn't quite feel consistent enough. Is there any way that we could make these hierarchical pseudo-dimensions actually behave like first-class top-level dimensions? Otherwise, we're going to be re-implementing indexing semantics within the Hierarchical types, and it's a bit of an extra hoop for users to jump through to use the custom type.

We can play with all of this on the master branch. This is a wonderful direction to head in. Thanks again, Tom!

@tshort
Copy link
Collaborator Author

tshort commented Mar 9, 2015

I just force pushed a rebase. Hopefully, I didn't mess things up in doing that.

@mbauman, great points! I particularly like: "Is there any way that we could make these hierarchical pseudo-dimensions actually behave like first-class top-level dimensions?" No great ideas here, but we should think about it. Generality would be nice. For example, if one had a hierarchical index with dates and stocks, sometimes, you might want to lookup by stock, and sometimes you might want lookup by date range. With the existing approach, you could have two objects, both of which point to the same data, but each has a different hierarchical index. To have this functionality built in to one axis that looks like two axes, we'd need secondary keys or database-style indexing using B-trees or bitmap indexes.

I definitely agree with moving away from tuples. I'm not sure I like turning Interval inside out. It's nice to be able to select one or more values with something like A[:z, :] or A[[:a; Interval((:b,:x), (:b,:z))], :]. (I'm not sure how those would be expressed with the H type.)

@tshort
Copy link
Collaborator Author

tshort commented Mar 9, 2015

The rebase seems to be good, so I'm pulling the trigger...

tshort added a commit that referenced this pull request Mar 9, 2015
Add a SortedVector for keyed axis indexes and hierarchical indexing
@tshort tshort merged commit fee36ba into master Mar 9, 2015
@tshort tshort deleted the sortedvector branch March 9, 2015 17:27
@tshort
Copy link
Collaborator Author

tshort commented Mar 9, 2015

@mbauman, I'm coming around to your idea of moving Interval into the "inside". I think we could still retain something like the indexing I mentioned above. One option for this type is to store it like a DataFrame, where each column is a vector. When it's indexed with a single integer, we could return a tuple. That would make it easy to convert back and forth between a DataFrame.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants