-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pooling strings #895
Comments
You've probably already considered this, but Symbols are effectively globally pooled strings. OTOH, I think garbage collection may eventually become an issue for some applications.. One potential scenario is a web service that uses DataFrames. You don't necessarily want to have to restart the Julia process whenever the pool gets too big. So maybe we're better off with an abstraction that leaves the door open to local string pools. |
I looked at using Symbols as a global pool for strings. That helps memory wise, but it doesn't give us a nice mapping to integers that helps speed up grouping. Garbage collection is possible. I'm planning to skip it initially to avoid the extra bookkeeping headaches. I am looking at ways to use local string pools, too. That tends to require the user to do more work to keep track of which object goes with which pool. Local pools can be more compact, so they should be faster. In my existing code, I've got an ID type parameter on pools to try to keep pooled types separate, so you don't |
This approach probable has some advantages, but would it allow ordering the levels? This small feature is essential in many applications in order to replace PDAs IMHO. Is the plan to combine Other than that, I agree starting without a GC is reasonable, as long as you know how it could integrate with this system later. |
There's no plan yet for combining/converting to categorical or ordinal variables. This still needs to be worked out. The current For ordinal types, more effort is needed. As with John's code, an |
Though even without ordinal variables, you often want an ordering, be it to present the results in a table or a plot, or to choose the default contrasts in models. When the pool is specific to a variable, you get the ordering for free. But when the pool is global, you need to store the order in another way if you want it. |
The benchmarks for the pooled strings looks awesome! It's definitely something that should be implemented in the long run. |
@alyst, I agree on trying to decouple DataFrames from concrete implementations. @nalimilan, I agree on wanting ordering. I think it needs to be in an extra step. For example, we can probably support this in two ways:
|
See the following for a more fleshed-out package: https://github.com/tshort/PooledElements.jl Not much for docs, but you can look at the tests to get some idea of the features. The |
Overall, things seem to be coming together well. Saving and loading with JLD works. Ordering and converting between pools is working well and seems efficient. Here is an example of converting to use of another Pool and then an example of converting to a sorted Pool: x = PooledStringArray(Pool(["x", "a"]), ["b", "c", "a"])
y = repool(x, Pool())
z = repool(y, Pool(sort(levels(y)))) The biggest issue remains garbage collection. I've left it open to support different Pools, so a GC-enabled pool could be added. The biggest headache that would follow is that each type that stored references to a GarbageCollectedPool would have to notify the pool type of any additions or deletions to pool references, and it would need a finalizer to delete references as appropriate. That's not too bad for the couple of types we have currently, but it would make it more difficult to add types that support pools. So for now, I think it's best to leave it out and do our best to support non-global pools and conversion between pools. |
@nalimilan, is there anything else to do here? |
I think this still needs considering/discussing. Let me restate the idea. In the majority of cases, data frame columns contain categorical data, defined in the broad sense as a vector with a small number of unique values. It would be silly not to take into account this situation, in particular when importing data from CSV, since storing references to a pool instead of the actual string radically improves performance regarding memory use and efficiency of processing (e.g. for merging, joining and grouping). That said, we already have
So overall the difference is quite limited. I think point 1. could be further alleviated by making The remaining question is: what should we do by default e.g. in CSV.jl? As I said, in most cases pooling strings should increase performance a lot. But should we do that by default? Probably. For example, R does that under the hood for all strings. On the other hand, R also creates categorical arrays (
Finally, pooling strings in general (whether as categorical or not) may not be a good idea in all cases: if lots of strings are unique (e.g. short texts, person names...) and you modify them, the pool will keep growing without any performance gain. Dropping unused strings or using reference counting will only add overhead. So overall this is a complex issue. I think we should at the minimum add a Overall I think that's not really an issue in DataFrames itself (except maybe for adding support for efficient |
After JuliaData/CSV.jl#102 CSV.jl will return So let's close this, we can always revisit this if experience shows we need yet another type for pooled strings. |
As part of experiments in speeding up grouping and joining (#894), using a string pool speeds up several operations. Plus less memory is generally allocated. I've written basic code to do this; see here. The general idea is to keep strings in a pool, so a string is only stored once. From the user's perspective, a global string pool is the easiest, but I have support for custom pools. The other advantage is that there is a mapping to integers which helps for grouping and joining.
I'm considering writing a package to fully implement this and integrate it into DataFrames. Before I do that, does anyone have any issues with the general approach?
Here are some possible stumbling blocks I've thought of:
isnull
for a PooledString and for a PooledStringArray. It's less clear how this fits in with the NullableArray ecosystem. This is an area where a Trait indicating support for the Null concept would be really nice. One huge plus is that pooling should reduce the use of PooledDataArrays in DataFrames (Replace PooledDataArray with NominalVariable/CategoricalVariable JuliaStats/DataArrays.jl#73). In fact, if we extend pooling to general types, we could probably get rid of PDA's in DataFrames.A pooled type is closely related to the topic of Categorical and Ordinal types (https://github.com/johnmyleswhite/CategoricalData.jl).
The text was updated successfully, but these errors were encountered: