-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a SortedVector for keyed axis indexes and hierarchical indexing #16
Conversation
Interesting comparison between R's An interesting constellation of time-related data structures is beginning to evolve here in Julia and I'm curious how they all fit together. I'm presently experimenting with Timestamps. I just wrote this yesterday after it became clear that adding and removing data inside a blotter (financial accounting object) requires more flexibility than TimeArrays offers. Since TimeArrays are immutable, a new one needs to be generated every time a change is made, and when changes happen frequently the overhead accumulates quickly. Timestamps also allows for a pseudo-hetergenous data structure, where value components can be of different type. Duplicate dates are also allowed. |
Very nice. I think you've convinced me. Indexing dimensional arrays by single values does seem to make sense. The sorted vector is the perfect solution for the slow I'll need to look closer at the hierarchical indexing changes. How is this different from adding another dimension? |
@@ -291,7 +291,7 @@ checkaxis(ax) = checkaxis(axistrait(ax), ax) | |||
checkaxis(::Type{Unsupported}, ax) = nothing # TODO: warn or error? | |||
# Dimensional axes must be monotonically increasing | |||
checkaxis{T}(::Type{Dimensional}, ax::Range{T}) = step(ax) > zero(T) || error("Dimensional axes must be monotonically increasing") | |||
checkaxis(::Type{Dimensional}, ax) = issorted(ax, lt=(<=)) || error("Dimensional axes must be monotonically increasing") | |||
checkaxis(::Type{Dimensional}, ax) = issorted(ax) || error("Dimensional axes must be monotonically increasing") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/monotonically increasing/sorted in increasing order'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to Wikipedia, "monotonically increasing" is correct here. The prior implementation checked for "strictly increasing".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, I had those backwards in my head. Sorry about that.
If indexing by a missing axis value returns |
A hierarchical index is like adding additional dimensions. It's a "long" format versus the "wide" format provided by adding another axis. It makes an AxisArray act a bit like a DataFrame. It's handy for cases where you have data with duplicate values and/or missing times or features. You don't need NA's for the missing value case because those rows just arent there. If you have many classifier variables with duplicates, a table view is easier to deal with. For the missing data case, it'd be interesting to try sparse arrays with normal axes and compare the usability and performance to a non-sparse array with a hierarchical index. |
Sweet. Let's give it a shot. Merge at will! |
to allow duplicates and add indexing by value. A SortedVector{Tuple} allows hierarchical indexing.
I've been thinking some more about this hierarchical indexing. It's a very interesting way to get Ragged arrays with axes to just work. If we were to punt to an actual RaggedArray implementation for the array storage, we'd need to have some way of having separate axis vectors for every column (or row, etc) that is ragged. Which is just a mess. This, on the other hand, is pretty elegant. I think we'll want to move away from using tuples for this, though, and use a fully custom type. I think it'd be nice for them to work almost like full-featured indices. For example, instead of an The downside is that it becomes pretty verbose: On the implementation side of things, I think we could only support up to three or four nested dimensions, with an The other point that I've been thinking about is that this just doesn't quite feel consistent enough. Is there any way that we could make these hierarchical pseudo-dimensions actually behave like first-class top-level dimensions? Otherwise, we're going to be re-implementing indexing semantics within the Hierarchical types, and it's a bit of an extra hoop for users to jump through to use the custom type. … We can play with all of this on the master branch. This is a wonderful direction to head in. Thanks again, Tom! |
I just force pushed a rebase. Hopefully, I didn't mess things up in doing that. @mbauman, great points! I particularly like: "Is there any way that we could make these hierarchical pseudo-dimensions actually behave like first-class top-level dimensions?" No great ideas here, but we should think about it. Generality would be nice. For example, if one had a hierarchical index with dates and stocks, sometimes, you might want to lookup by stock, and sometimes you might want lookup by date range. With the existing approach, you could have two objects, both of which point to the same data, but each has a different hierarchical index. To have this functionality built in to one axis that looks like two axes, we'd need secondary keys or database-style indexing using B-trees or bitmap indexes. I definitely agree with moving away from tuples. I'm not sure I like turning |
The rebase seems to be good, so I'm pulling the trigger... |
Add a SortedVector for keyed axis indexes and hierarchical indexing
@mbauman, I'm coming around to your idea of moving |
Add a SortedVector type that assumes sorting. Also, change the Dimensional trait to allow duplicates and add indexing by value. A SortedVector{Tuple} allows hierarchical indexing similar to pandas or data.table. SortedVector hits one of the roadmap (issue #7) bullets.
This also changes axis indexing in two ways:
Here are examples of SortedVector and hierarchical indexing:
Using an Array of Tuples might not be the most efficient storage format for this, but it will allow us to play around with this feature. For analysis of time series data in R, I use both zoo (it's like AxisArrays) and data.tables. The zoo package provides nice plotting, so I tend to use that for more structured data, and I use data.tables when I need more flexibility. My motivation for this PR is to have one package/data structure that can be used for both approaches.