Skip to content
This repository has been archived by the owner on May 4, 2019. It is now read-only.

Replace PooledDataArray with NominalVariable/CategoricalVariable #73

Closed
simonster opened this issue Feb 8, 2014 · 66 comments
Closed

Replace PooledDataArray with NominalVariable/CategoricalVariable #73

simonster opened this issue Feb 8, 2014 · 66 comments

Comments

@simonster
Copy link
Member

In #67, @johnmyleswhite proposed replacing PooledDataArrays with OrdinalVariable/NominalVariable enums. Here are four possible approaches in order of my current preferences regarding them:

  1. Keep the concept of the PooledDataArray for efficiently storing values, but make getindex wrap extracted value as an OrdinalVariable and NominalVariable. So far, this is looking like the best approach to me. Pros: Efficient storage and avoids most of the potential performance pitfalls of the below approaches. Cons: We don't get to get rid of much code.
  2. Make OrdinalVariable and NominalVariables have a field for the array of possible values they can take on. Pros: We kill PooledDataArray. Cons: This incurs both memory and performance overhead, since the array reference means an extra 8 bytes and GC root per object. (I have no idea how much this matters. This makes garbage collection really slow.)
  3. Assign each enum an ID, keep this ID in the objects, and keep the pool in a global array. Pros: We kill PooledDataArray, and we only need an extra few bytes per object and we don't need GC roots. Cons: Pools never get garbage collected.
  4. Make OrdinalVariable and NominalVariable immutables parametrized by tuples that represent the possible values, or similarly, dynamically generate types at runtime. Pros: We kill PooledDataArray but store values equally efficiently. Cons: While this seems clever, it is really an abuse of the type system. Julia will compile specialized code for each enum, which has undesirable performance characteristics and is even worse than (3) as far as leaving around data that never gets garbage collected. The tuple approach has the additional disadvantage that type inference probably stops working if there are >8 values in the tuple.
@nalimilan
Copy link
Member

Are you aware of #50 (comment)?

@johnmyleswhite
Copy link
Member

Thanks for laying out these ideas, Simon.

I'm not clear how we handle per-variable orderings with the proposal described in (1). If I do pda[1] < pda[2], I want < to be a function defined uniquely for that specific ordinal variable. So every instance of OrdinalVariable would need to link back to its source PDA or every instance of OrdinalVariable would need to be a custom type.

How are you thinking of handling those kinds of issue?

@simonster
Copy link
Member Author

I was thinking that for (1) and (2), OrdinalVariable object would look like:

immutable OrdinalVariable{T<:Integer,S}
    index::T
    pool::Vector{S} # Or maybe a wrapper over the pool?
end

Then we'd have:

function Base.(:<)(x::OrdinalVariable, y::OrdinalVariable)
    x.pool === y.pool || error("OrdinalVariables are not comparable")
    x.index < y.index
end

@johnmyleswhite
Copy link
Member

Everything you've said, @simonster, seems completely reasonable. I agree that your first proposal would be the most straightforward way to move forward.

But I can't yet shake my sense that making everything nominal/ordinal variable into its own type is the simplest way to do things when you want to work with categorical data without using DataArrays. (This, of course, assumes that it's actually reasonable to care about ordinal factor without also caring about missingness. I think it is, but may get convinced it's not worth the effort.)

Rather than parametrize each runtime generated type with extra information, I was thinking that we would just define functions like: order(T::MyNewType). As you've already said, this function definition won't get garbage collected. For ordinal factors with >100,000 levels, this could be a problem. But I don't know how common that use case comes up in practice. When I work with categorical data, I'm typically working with a factor that has a few hundred levels at most. In that case, I'm not too worried about garbage-collection.

@simonster
Copy link
Member Author

Both (3) and (4) have the problem that the pool itself is getting leaked, but dynamically creating types has the additional problem that every function gets inferred/compiled for each type. The performance and memory characteristics depend on the number of types * the number of different functions called in addition to the size of the data. See the quick benchmark I made here. With the n_data = 1000 and n_vars = 100, the timings look like (in seconds):

option 1: 0.007096643
option 2: 0.000856045
option 3: 0.000178305
option 3.5: 7.3403e-5
option 4: 1.166516775

With n_data = 50_000_000 and n_vars = 1:

option 1: 0.755386073
option 2: 1.2290704
option 3: 0.16911997
option 3.5: 0.034059569
option 4: 0.034355416

So for some but not all cases, dynamically generating types can be far slower, since reinterpret, test_access, and getindex must be compiled for each possible type. I am not sure whether 100 enums is a good reflection of a real data set, since we would probably want to use one type for all enums with the same pool which might only mean a couple of enum types for typical cases, but this is something we should think about.

The option that I didn't consider in my original post, which I call option 3.5 above, is using an enum ID as in (3) in conjunction with a special array type as in (1) to get efficient storage but also avoid a GC root when accessing the values. If the ID is a Uint32, this easily beats all options but (4) for any choice of n_data and n_vars, beats (4) up to n_data = 50_000_000 with n_vars = 1, and is asymptotically about 1/3 slower. Moreover, unlike option (4), there is no need to dynamically generate types and so there is no overhead for compiling functions for each different pool. If we don't care about garbage collecting the pool, then this is an option to consider.

As far as categorical data without missingness with a PooledDataArray-like type, one possibility is to have a separate PooledArray type that shares most of the implementation but defines different getindex and setindex methods that don't accept NA.

@simonster
Copy link
Member Author

Those benchmarks aren't entirely fair, since some have to deal with garbage produced by preceding benchmarks and some benchmarks include copy overhead while others don't. The below timings should be fairer. With n_data = 1000; nvars = 100:

option 1: 0.000691289
option 2: 0.000917854
option 3: 0.000205124
option 3.5: 8.7721e-5
option 4: 1.307516131

With ndata = 50_000_000; nvars = 1:

option 1: 0.547448457
option 2: 1.013021837
option 3: 0.173764345
option 3.5: 0.055936094
option 4: 0.056068792

My feeling is that option 3.5 (storing raw values and the enum ID in a PooledDataArray type, and maintaining a global array of enum pools) is the way to go. Options 1 and 2 incur substantial garbage collection-related overhead for large data sets. With option 4, performance drops with the number of enums and the number of functions you plan to call on them, since each function needs to be compiled for each enum type. With option 3.5, performance depends only on the amount of data, and even if the user stores the enums in regular vectors instead of a special structure, it is not hugely inefficient (option 3 is ~3x slower).

@nalimilan
Copy link
Member

I'm not very comfortable with the idea of not being able to clear the pool from memory. Even if in most cases it shouldn't be a problem, in some cases the number of levels might be relatively high. When loading data produced by someone else (e.g. survey data), there may be many variables, some of which with many levels, and if they are marked as categorical they will waste memory which cannot be freed without restarting Julia (I need to do this with R quite often and it's painful).

Let's dream a little. In theory, it seems that the best solution would have the following properties:

  • the pool wouldn't be duplicated in each vector or element, but instead stored either via the type system (enum) or in a global pool
  • the enum/pool would be garbage-collected when no longer used anywhere
  • there would be no specialization of functions for each variable/enum, as all share a common structure and compiling several times the function is a waste of time and memory
  • indexing would be efficient (ideally, no need to create an object)
  • construction of the vector would be efficient, but that's secondary as you index a vector much more often than you create it.

Using enums and thus letting the type system handle the global pool sounds like a good solution, which would allow writing simpler code. Also, the type system is (or could be made) smart enough to avoid compiling two specialized functions for two enums which are actually Uint32 under the hood. Maybe there's a chance Julia could allow specifying that some types/enums should be garbage-collected if no object uses them anymore? Could some kind of reference counting be implemented? In any case, asking main developers wouldn't be a bad idea.

@simonster Regarding the benchmarks, I think you shouldn't build the vectors in the loops, as that's a more less common operation than indexing. Does the code in the gist correspond to your last comment?

@simonster
Copy link
Member Author

To make the Julia avoid compiling two specialized functions for two enums with the same fields, you need 1) abstract types with fields (JuliaLang/julia#4935) so that the compiler can optimize field access and 2) some way to specify that the compiler should not specialize on the subtypes (potentially related to "direct call methods specialized for non-leaf types" from JuliaLang/julia#3440, although in this case we'd have to be able to specify that on the type and not on the method). This is a little "featurey." There is also JuliaLang/julia#3080, although in that discussion it seems like people would like to be able to specialize/dispatch on the enum, which we probably do not want.

GC of types poses its own problems. Objects cannot simply hold a reference to the type without the same kind of overhead as in option 2 (which makes GC take 0.5 seconds with 50,000,000 values). I don't know if reference counting would really be much better, since we would still have to increment the reference count every time a value is retrieved from the array and decrement it when the value goes out of scope. This idea seems really featurey, since it isn't really intended that types are dynamically generated in the course of executing code in the first place.

In general, hacking this onto the type system continues to seem kind of wrong to me. The main purpose of the type system is dispatch and specialization. We don't want those things, just efficient storage.

Maybe @JeffBezanson or @StefanKarpinski have some thoughts here?

Here are the benchmarks for just array accesses. While option 2 looks better here, it is not really an option because, as noted above, with 50,000,000 data points and thus 50,000,000 references to the pool, it makes every GC take half a second. (This is not a problem with the other approaches; option 1 holds only one reference to the pool and the others hold no references.) Also the benchmarks for option 4 would obviously be worse if I were calling more than two functions that need to be specialized for each type. With n_data = 1000; nvars = 100:

option 1: 0.0007115400000000001
option 2: 0.00013821499999999994
option 3: 0.00011090599999999998
option 3.5: 0.0001345170000000001
option 4: 0.2781348129999999

With ndata = 50_000_000; nvars = 1:

option 1: 0.541569776
option 2: 0.050610905
option 3: 0.023341166
option 3.5: 0.035011226
option 4: 0.026047237

@nalimilan
Copy link
Member

Maybe using types is not the best solution after all.

Anyway the big issue with all solutions is how to keep a global pool which is not attached to any individual object, and yet could be garbage-collected when no longer needed. Indeed, for that you (or Julia) need to keep track of all objects which are using the pool, either by adapting the reference count when creating/deleting objects (and thus when indexing), or go over all existing objects at gc time.

@nalimilan
Copy link
Member

Let me add another alternative to be sure we have weighted all the advantages and drawbacks of using types to handle this.

@johnmyleswhite's comment #73 (comment) seems to point to a fifth solution, which could be more efficient (even though the garbage collection issue isn't fixed): create one type for each variable, which would only wrap a Uint32, have a order(T::MyNewType) function to get the levels, and make PDAs a simple array of these objects. The big win here is that when indexing you don't need to create objects on the fly, and you give the compiler all it needs to know to optimize operations. The handling of levels is moved to the type system.

Obviously all the issues about unwanted specialization and garbage collection do not go away in this scenario.

@simonster
Copy link
Member Author

I'm not sure if/how what you're suggesting differs from option 4 as implemented above. (I didn't implement the order method, but that's because I wasn't benchmarking it.)

One compromise between garbage collection and performance would be a variant of 3.5 where we maintain a strong reference to the pool in each PooledDataArray and a weak global reference. Then we'd avoid the overhead of maintaining references in each enum and the pool would get garbage collected after every PooledDataArray that uses it gets garbage collected. For most use cases I think this would work well, but it's certainly possible to write code where the PooledDataArray falls out of scope before the enum, which could be surprising, and resulting bugs would be nondeterministic, which is not great.

@nalimilan
Copy link
Member

Yeah, that's actually option 4), but not parametrized on a tuple of levels.

@simonster
Copy link
Member Author

@johnmyleswhite, can I bother you to get your current thinking on this?

@johnmyleswhite
Copy link
Member

Yeah, let me think about it again and get back to you. I'd really like to hear @StefanKarpinski, @JeffBezanson or @ViralBShah's opinions before we make a decision. In retrospect, I think a lot of the design of DataArrays/DataFrames was too close to R's idioms and not optimized for Julia. I'd like our decision here to be more appropriate for what Julia is going to offer when 1.0 comes out.

@johnmyleswhite
Copy link
Member

Coming back to this with fresh eyes, Option 3.5 does indeed seem best. Could we expose a mechanism for intentionally freeing the pool? This would only be used by performance buffs. Even though it's a little ugly, it does seem like the best solution.

I'm going to e-mail Jeff, Viral and Stefan offline to see if I can get them to chime in.

Once we resolve this, I'd like to also debate the remaining design issues for PDA's:

  • How do we represent the ordering? Presumably just by labeling each element of the pool with a corresponding integer.
  • How will unique/levels behave? This is Semantics of unique and levels #29.
  • Can the pool contain levels that aren't present in the data?
  • What operations on the pool are supported? And what's the interface like?

@StefanKarpinski
Copy link
Contributor

I haven't read all of the commentary on this issue (there's a lot of comments here and in the linked discussions), but I think that part of the confusion here comes from conflation of semantics and implementation. There are data that you want to represent in a DataFrame that are nominal or ordinal and you want them to behave as such. As John observed, the nominal/ordinal nature of a variable must be linked to the element type, not a property of the container – otherwise when you take the elements out of the container, they have different behavior, which is bad. The solution there seems pretty obvious: use appropriate ordinal and nominal types. This isn't really related to the data frame abstraction at all. It makes sense to have types for these kinds of things, possibly something like an enum (although the difference between ordinal and nominal, not to mention flags indicates to me that maybe we need a slightly finer gradation).

Then there's also the issue of efficient storage and operation – which should not be confused with behavior. The fact that nominal and ordinal values can often be stored and operator on particularly efficiently might be the red herring here. I'm unconvinced that they should be related. Shouldn't this be like building an index for a data frame column? The element type is still the same and it doesn't affect the behavior of its values, but it provides a different, possibly more efficient implementation?

@StefanKarpinski
Copy link
Contributor

In other words, I feel like this entire discussion assumes implicitly that nominal/ordinal values and value pooling should be linked – after all, it's often sensible to use a value pool when you have nominal/ordinal values and vice versa. But that's not necessarily the case. You could, for example, have a BigInt column that is truly integral, but because BigInts are huge, you may want to pool their storage. That data is pooled (implementation), but numerical. You might also have nominal data that you don't want to pool because the values are small and easier to simply store inline.

@simonster
Copy link
Member Author

This is basically the structure we have for PooledDataArrays right now, but a switch to enums would suggests two pools: one for the enums that contains its levels, and one for the PooledDataArrays that contains enums. In some sense this is more complicated, and I'm not convinced that the use case of storing BigInts/BigFloats in a PooledDataArray is a particularly strong one, but from a performance standpoint it might work better than the approaches I've discussed above, since we can just get the enum out of the pool instead of creating a new object on getindex, and we only need as many enums as there are levels in the PooledDataArray, so we don't have to worry about generating huge numbers of garbage-collected objects. I will try this and see how it performs.

@StefanKarpinski
Copy link
Contributor

If you're using an enum, why do you need the pool? The only cases where I can see needing the pool is a case where you have a lot of strings or something and you want to reduce storage or do faster indexing. What kind of data really benefits from pooled storage?

@simonster
Copy link
Member Author

How do we store our enums in an ordinary array? Either we potentially have many thousands of objects that each hold a reference to the enum pool, which makes GC take forever, or we have to make the enum pools un-garbage-collectable. There is an additional concern of storage efficiency, since each enum needs to hold some kind of (redundant) pool identifier in addition to the level, but I don't think that's a big deal.

@StefanKarpinski
Copy link
Contributor

I don't know what you're thinking about when you say "enum" but I'm thinking about a thin immutable types that wraps a Cint. It doesn't get any more efficient than that and there's no GC to do. Why do enums need a pool at all?

@simonster
Copy link
Member Author

Let's say you have a set of responses on a Likert scale, where each response is one of "strongly disagree," "disagree," "neutral," "agree," and "strongly agree." Those five levels are what I'm calling the enum pool, but perhaps that is a confusing name for it. We need a way to go from a given value back to the level to which it refers, or else you can't e.g. select all rows of a DataFrame where the response is "strongly agree."

@StefanKarpinski
Copy link
Contributor

I feel like this should be the same as the proposed @enum functionality in base, no? The representation is just an int. The really bare bones implementation is here. The main issue with that is that it doesn't print nicely. That could be fixed by either generating a show method that switches on the enum value and prints its name instead of enum(n).

@kmsquire
Copy link
Contributor

See JuliaLang/julia#2988 for a possible implementation.

The title is misleading, since the PR includes a non-bitmask version of @enum that's more complete. A version of just that functionality is used by Logging.jl here.

@simonster
Copy link
Member Author

This has the same problem as the other type-based approaches: functions get specialized for each enum type, which means that if you have 100 different categorical columns you are compiling a lot of functions 100 times. Also both the functions and the lists of levels stay around in memory forever. There are major differences between what we want for doing stats and what @enum is meant for: we may have hundreds of different enum types, the levels are only known at runtime, and we don't want to specialize functions on individual enum types.

@StefanKarpinski
Copy link
Contributor

Ok, these are good examples. Why not use strings or symbols? Do you have examples where pooling the storage is essential?

@nalimilan
Copy link
Member

Pooling the storage is essential in most uses; in the example of Likert scales, you don't want to repeat one million times "strongly disagree," "disagree," "neutral," "agree," and "strongly agree" as strings, as it would be terribly inefficient. And you don't want to store integers directly because it would be terribly confusing.

We seem to need a specific type of enum, which would not trigger function specialization, and which would be garbage collected when no longer used.

@StefanKarpinski
Copy link
Contributor

One option would be just autopool strings behind the user's back. This is an option because Julia treats strings as immutable, so any two strings with the same value are just as good as each other. The strings still just act as strings but you'd avoid all the storage overhead of repeating the strings over and over again.

@johnmyleswhite
Copy link
Member

After implementing some ideas for working with missing scalar values in the NullableTypes package, I decided to go ahead and implement Simon's 1st proposal at the start of this issue so that we could have some way of working with scalar values of categorical datatypes. This reflects my concession that getting an Enum type to work isn't necessary.

Based on Simon's proposal, I've built up a potential backend for working with scalar values that are instances of either categorical and ordinal variables. You can check out what I've done at https://github.com/johnmyleswhite/CategoricalData.jl.

Basically, I've:

  • Replicated some of the PooledDataArray backend functionality in a way that's not inherently linked to missingness or to any specific array storage format. This makes it easier to talk about categorical variables without any references to an underlying PDA object. As a consequence, it's a little bit easier to talk about scalar objects that are categorical variables.
  • Added some basic tools for manipulating the pools from which values are drawn with methods like levels, levels!, order and order!.
  • Added types that represent scalar objects that implement either a single value of a categorical variable or a single value of an ordinal variable.
  • Defined ordering operations on ordinal variables so that scalars can be compared based on their rank order. These operations also raise a meaningful error message when you attempt to compare the order of categorical factors.

One thing that I've done that's not strictly necessary, but I would argue is helpful, is to remove any notion of missingness from the underlying representation of categorical data. My idea is that we'd replace the current PooledDataArray type with two new types, CategoricalArray and OrdinalArray, that build on top of the functionality provided in CategoricalData.jl. Those new types would handle the problem of missingness for arrays.

At the scalar level, we'd kick things back to NullableTypes.jl. Thus, indexing a scalar element from a CategoricalArray or a OrdinalArray would produce either an object of type Nullable{CategoricalVariable} or Nullable{OrdinalVariable}. That added bit of indirection ensures that the interface for these objects would be consistent with the interface for Nullable{T} for any other type.

@garborg
Copy link
Member

garborg commented Aug 17, 2014

I like the CategoricalArray without missingness -- I've thought about that when dealing with PDAs.

@johnmyleswhite
Copy link
Member

Sorry, I was a little unclear. My idea was that CategoricalArray would support missingness just as PDA does now (by using a pointer to an invalid index like 0), but that CategoricalPool would not support a missing value as a level. (Unless, of course, the missing value were explicit like "Did not answer question.")

@johnmyleswhite
Copy link
Member

What's the use case for allowing users to say that CategoricalArray not permit NULL? To get something like the non-NULL constraint we're allowing now by inserting an Array into a DataFrame?

@garborg
Copy link
Member

garborg commented Aug 17, 2014

It would definitely be good for that.

I suppose the real goal is, that however people end up representing arrays of categorical data without missing values elsewhere (perhaps an Enum type in Base), the API for the categorical version of DataArrays should mirror the Enum-like type as DataArrays mirrors Arrays (working with either is near identical except for added code/computational complexity to deal with missingness when working with DataArrays), and, for example, a function that builds a ModelMatrix should be be happy to accept either type.

That seems like a consideration to push aside for now in favor of easier experimentation. But I really appreciate the work to make DataArrays and DataFrames consistent with Base -- I'm the kind of person who forgets a library's names, syntax quirks, etc., after a very short time away.

@johnmyleswhite
Copy link
Member

I guess the question for me is how Enum relates to CategoricalVariable. OrdinalVariable is, in my mind, clearly distinct from Enum because of the support for a custom order.

So the question is whether CategoricalVariable is redundant in the face of Enum.

I think the answer is probably no, but I'm not super sure.

Here's what I feel comfortable saying:

  • People are going to want to redefine the number of levels that a CategoricalVariable can take on. For example, given a CategoricalVariable with levels like "Far Left", "Left", "Moderate", "Right" and "Far Right", it's very natural to want to recode this variable as "Left, "Moderate" and "Right". As I understand it, this kind of recoding would introduce type-instability if the number of levels were a property of the Enum's type. (Saying this, I realize that I haven't written any code that provides a nice interface for recoding variables. I think that needs to happen at the CategoricalArray level to be really smooth.)
  • It's important that a CategoricalVariable that wraps T not behave like T by default. It should, for example, not be possible to add two CategoricalVariable{Int} objects together. I think Enum might do this, but definitely CategoricalVariable does. Not a strong point, but maybe important.

@garborg
Copy link
Member

garborg commented Aug 17, 2014

Very good points in the bullets. I remember seeing in discussion of Enums for Base that order may be handled there, so it's possible that if CategoricalVariable was redundant OrdinalVariable would be, too, but I think you're right that neither will be redundant.

If neither type is redundant, it seems increasingly desirable to for both to have mirror types that don't take NA, but I'm not ready to take a strong stance one way or the other there.

@simonster
Copy link
Member Author

I think we actually want the CategoricalPool to hold an array of CategoricalVariables so that, instead of constructing a new CategoricalVariable on every access to a CategoricalArray, we would can just pull the CategoricalVariable out of the CategoricalPool. The reason is that a CategoricalVariable is a non-pointerfree immutable (because it holds a reference to the CategoricalPool, which holds an array), and it appears that Julia presently allocates these on the heap, i.e., they have performance characteristics more like regular objects than pointerfree immutable objects. To avoid allocation on every getindex call, we could just return a CategoricalVariable that's already constructed instead of constructing a new one. The only problem is that then the definitions of CategoricalVariable and CategoricalPool are mutually circular and you'd need to use one of the workarounds from JuliaLang/julia#269 to specify them.

I still think that Enums of the type people want in Base, which are primarily for controlling what code gets executed, are different from CategoricalVariables, which are primarily for holding data, and it's important we not conflate these two.

@garborg
Copy link
Member

garborg commented Aug 17, 2014

I kind of regret referring to Enums now without knowing more about what
they'll look like -- what I really wanted to say is that it seems
reasonable to me that someone dealing with categorical data that didn't
involve missingness would prefer to work with a type that doesn't take NA,
and a shared API (apart from the some functions dealing with missingness)
would be nice. (Not having played with this package, I can't really say yet
whether I think it would work or be worth it.)

On Sun, Aug 17, 2014 at 3:04 PM, Simon Kornblith notifications@github.com
wrote:

I think we actually want the CategoricalPool to hold an array of
CategoricalVariables so that, instead of constructing a new
CategoricalVariable on every access to a CategoricalArray, we would can
just pull the CategoricalVariable out of the CategoricalPool. The reason is
that a CategoricalVariable is a non-pointerfree immutable (because it holds
a reference to the CategoricalPool, which holds an array), and it appears
that Julia presently allocates these on the heap, i.e., they have
performance characteristics more like regular objects than pointerfree
immutable objects. To avoid allocation on every getindex call, we could
just return a CategoricalVariable that's already constructed instead of
constructing a new one. The only problem is that then the definitions of
CategoricalVariable and CategoricalPool are mutually circular and you'd
need to use one of the workarounds from JuliaLang/julia#269
JuliaLang/julia#269 to specify them.

I still think that Enums of the type people want in Base, which are
primarily for controlling what code gets executed, are different from
CategoricalVariables, which are primarily for holding data, and it's
important we not conflate these two.


Reply to this email directly or view it on GitHub
#73 (comment)
.

@johnmyleswhite
Copy link
Member

Ok. It sounds like we're all in agreement that Enum's aren't quite what we want. That makes me feel better about having implemented this stuff. :)

@simonster, your note about being pointer-free is well-taken. I'll try to cycle back on this soon and experiment with an alternative implementation that includes a small set of constant CategoricalVariables in a pool that can returned to users. There's also a question of whether the CategoricalPool should be immutable or not. The levels! function would be a lot simpler if the pool were mutable.

@garborg, the package I wrote doesn't provide much usable functionality: it's more a demonstration of a potential backend than a drop-in replacement for PooledDataArray's. I can demo out an implementation of CategoricalArray's and OrdinalArray's to give a feel how these things would work out in practice.

@nalimilan
Copy link
Member

@johnmyleswhite I don't get why OrdinalPool() takes two arguments. Wouldn't it be enough to take the levels in the order they appear in the first argument?

I also have a small remark regarding the naming: CategoricalVariable evokes an array to me; maybe it would be better to call it CategoricalValue instead?

Regarding enums, I agree with @garborg that we don't know yet how they will work, so it's hard to tell whether they would make CategoricalVariable redundant or not.

@johnmyleswhite
Copy link
Member

@nalimilan You're right: there's no reason to have the two-vector constructor for OrdinalPool. I noticed that in my last pass through the code, but was too lazy to fix it.

CategoricalValue works for me.

@nalimilan
Copy link
Member

With a Ph.D. and a permanent position to start soon, I've finally found the time to take up @johnmyleswhite's work on CategoricalData.jl, and bring it to a point where I think it can constitute a replacement for PooledDataArrays: https://github.com/nalimilan/CategoricalArrays.jl

The features are essentially the same (though with additional support for ordered levels), and the API is very similar, except that instead of suffering from the NA type instability, it comes in two variants: one which does not support missing values, and one which returns Nullable objects. It should thus be a good companion to NullableArrays, but for categorical data.

Please have a look at the README (and if possible at the implementation) and tell me what you think. If you agree, I'd like to move this to JuliaStats soon, and open a branch in DataFrames which no longer uses DataArrays at all.

Cc possibly interested people: @andreasnoack @davidagold @tshort @bkamins

@andreasnoack
Copy link
Member

Looks great. A few comments

  • Carrying around a Pool reference in the values will probably be quite costly but I guess that most of the computationally intensive operations will take place in functions that take an Ordinal/NominalArray as input and, therefore, computations on Ordinal/NominalValues would be relatively rare. You have probably been using these types so how important/useful to have the Pool referenced in the values? You could think of the Pool as a property of the collection only, i.e. when extracting a value then you "loose" the categorical information. I've asked earlier but I don't think I got a reply.
  • Have you thought of having a fixed number levels in the pool such that you could use a tuple instead of a vector for storage?

@nalimilan
Copy link
Member

Carrying around a Pool reference in the values will probably be quite costly but I guess that most of the computationally intensive operations will take place in functions that take an Ordinal/NominalArray as input and, therefore, computations on Ordinal/NominalValues would be relatively rare. You have probably been using these types so how important/useful to have the Pool referenced in the values? You could think of the Pool as a property of the collection only, i.e. when extracting a value then you "loose" the categorical information. I've asked earlier but I don't think I got a reply.

I haven't done systematic benchmarks yet. You're right that the most common use case is probably to pass a whole array, not individual values.

The reference to the pool may not be required after all. As I see it, it can be useful as it would allow comparing values from two distinct pools which are !== but are == (by testing that levels and order are equal). But that comparison would be dead slow anyway WRT just testing for equality of two pointers and of two integers. So a better solution could be to store the hash of the levels array, which would allow making NominalValue/OrdinalValue pointer-free immutables. That sounds so nice that I wonder whether I have forgotten a reason why we didn't choose that solution.

Have you thought of having a fixed number levels in the pool such that you could use a tuple instead of a vector for storage?

I don't think that would be very practical, as it would introduce type instability in a lot of places. It's already quite annoying to change the integer type used to store references based on the size of the pool (which isn't done automatically for this reason). Also, do you think it would bring a significant speed improvement? Accessing an array by integer indexing is fast too.

@nalimilan
Copy link
Member

I remember the problem now: storing (a reference to) the pool is the only way to provide both very fast equality checks between values from the same pool (the most common scenario), while still supporting comparisons between different pools. In the first case (same pool), you can simply check that pools are ===, and then compare the integer indices. In the second case, you can extract the actual levels from the pool and compare them (which is slower for e.g. strings).

If we don't want to keep a reference to the pool, we need to store the actual level, which in the common case will be a string: AFAIK this wouldn't be really more efficient than keeping a reference to the pool, would it? Except if we could find a way to store strings inline, e.g. as a tuple of chars, as an implementation detail.

(While it's currently and error in the package, I think the ability to compare values from different pools is important, as when importing data arrays may have the same set of levels but not the same pool objects, and requiring the user to merge them would be painful.)

Forgot to CC @quinnj.

@quinnj
Copy link
Member

quinnj commented Jun 11, 2016

Thanks for the CC @nalimilan; I'll look more into your fork of CategoricalArrays.

I've read through a good number of the comments here, but I'd like to add one more point on @simonster 's original point 4, which would make a type definition like

# `O` = whether the Category is ordered or not
# `I` internal storage type
# `T` = tuple of symbols representing Category level values
immutable Category{O,I,T}
    value::I
end

I dislike throwing out this option simply because of the current state of code generation in Julia (i.e. over-specializing every method that operates on Category). IMO, this is the simplest, most straightforward implementation and I'd hate to see us get additional code despecialization capabilities in 0.6 and then have an entire package or two become overly complicated. It'd be nice to be able to do something like:

immutable Category{O,I,T::ANY}
    value::I
end

that would help the compiler know that it shouldn't ever specialize methods on the T parameter.

I'd much rather write up an implementation based on this approach and then provide enough whiskey to @JeffBezanson, @vtjnash, and @carnaval during JuliaCon to add the ability to despecialize a type parameter 😄

@nalimilan
Copy link
Member

The downside with that approach is that any code modifying the levels would become type unstable if levels are not know statically. I'm also not sure what's the practical advantage: why would statically knowing the possible levels make anything faster?

@quinnj
Copy link
Member

quinnj commented Jun 11, 2016

Is it really a common use case to need to modify an existing Category's levels? I can see a case where you "build up" the levels first, outside the type, but in my mind, once you have them, you then build your Category type and you're done.

The practical advantage of a definition like Category is that we don't have to define yet-another-type-of-array like NominalArray/OrdinalArray. I could just have a NullableArray{Category, 1} and everything works fine. That seems like a huge win based on what I've seen from NullableArrays/DataArrays and the amount of machinery that needs to be defined to get your own Array type working.

@nalimilan
Copy link
Member

It's quite frequent in my experience to change levels e.g. because you take a subset of the dataset and you want to drop unused levels, or even more frequently to recode data. If the levels are part of the type definition, you need to pass them in advance before assigning them based on various conditions (which violates the DRY principle), and setindex! cannot even create a new level (which is one of the annoying things with R's factors).

The machinery required to implement a custom type is significant, but that's not the end of the world either. CategoricalArrays.jl is mostly functional now, and it's only ~600 lines of code under src/ (+ 2000 for tests...). It's simpler than PDAs since it doesn't need to implement arbitrary functions like arithmetic operators (since the data is categorical). And several constructors for the internal pool aren't actually used and could be removed.

@nalimilan
Copy link
Member

See JuliaData/DataFrames.jl#994 regarding port of DataFrames to NullableArrays and CategoricaArrays.

@nalimilan
Copy link
Member

Closing now that DataFrames has been ported (JuliaData/DataFrames.jl#1008).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants