Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pooling strings #895

Closed
tshort opened this issue Dec 9, 2015 · 12 comments
Closed

Pooling strings #895

tshort opened this issue Dec 9, 2015 · 12 comments

Comments

@tshort
Copy link
Contributor

tshort commented Dec 9, 2015

As part of experiments in speeding up grouping and joining (#894), using a string pool speeds up several operations. Plus less memory is generally allocated. I've written basic code to do this; see here. The general idea is to keep strings in a pool, so a string is only stored once. From the user's perspective, a global string pool is the easiest, but I have support for custom pools. The other advantage is that there is a mapping to integers which helps for grouping and joining.

I'm considering writing a package to fully implement this and integrate it into DataFrames. Before I do that, does anyone have any issues with the general approach?

Here are some possible stumbling blocks I've thought of:

  • Garbage collection--With my current implementation, unused strings in the global pool never get freed up. A reference-counting system could be used to re-use parts of the global pool. I'm not sure it's worth it, at least at the beginning.
  • Input/output--When reading in objects that refer to the global pool, the pool that's been saved must be synced with the stored pool. This is also an issue when serializing between processors. For JLD, I think it's straightforward to handle this, but we'd have to add JLD.jl as a package dependency. Another IO issue is reading CSV files. We would probably need an option to determine whether or not strings are pooled.
  • NA's/Nullables--A pooled string can easily have a built-in notion of Null just by storing an index of zero. It's easy to define isnull for a PooledString and for a PooledStringArray. It's less clear how this fits in with the NullableArray ecosystem. This is an area where a Trait indicating support for the Null concept would be really nice. One huge plus is that pooling should reduce the use of PooledDataArrays in DataFrames (Replace PooledDataArray with NominalVariable/CategoricalVariable JuliaStats/DataArrays.jl#73). In fact, if we extend pooling to general types, we could probably get rid of PDA's in DataFrames.
  • Precompiling--I think we could set up a global pool that doesn't interfere with the ability to precompile DataFrames. But, this is an area with some gotchas that I may have missed.

A pooled type is closely related to the topic of Categorical and Ordinal types (https://github.com/johnmyleswhite/CategoricalData.jl).

@simonster
Copy link
Contributor

You've probably already considered this, but Symbols are effectively globally pooled strings. ptr = pointer_from_objref(symbol(x)) should give you a unique pointer corresponding to a specific string, and unsafe_pointer_to_objref(ptr) can convert that pointer back to a Symbol. I'm curious how this would compare in terms of performance.

OTOH, I think garbage collection may eventually become an issue for some applications.. One potential scenario is a web service that uses DataFrames. You don't necessarily want to have to restart the Julia process whenever the pool gets too big. So maybe we're better off with an abstraction that leaves the door open to local string pools.

@tshort
Copy link
Contributor Author

tshort commented Dec 9, 2015

I looked at using Symbols as a global pool for strings. That helps memory wise, but it doesn't give us a nice mapping to integers that helps speed up grouping.

Garbage collection is possible. I'm planning to skip it initially to avoid the extra bookkeeping headaches. I am looking at ways to use local string pools, too. That tends to require the user to do more work to keep track of which object goes with which pool. Local pools can be more compact, so they should be faster. In my existing code, I've got an ID type parameter on pools to try to keep pooled types separate, so you don't vcat two pooled arrays with different pools and get nonsense answers.

@nalimilan
Copy link
Member

This approach probable has some advantages, but would it allow ordering the levels? This small feature is essential in many applications in order to replace PDAs IMHO. Is the plan to combine PooledString and CategoricalVariable when needed?

Other than that, I agree starting without a GC is reasonable, as long as you know how it could integrate with this system later.

@tshort
Copy link
Contributor Author

tshort commented Dec 9, 2015

There's no plan yet for combining/converting to categorical or ordinal variables. This still needs to be worked out. The current PooledString is very similar to @johnmyleswhite's CategoricalVariable. The type definitions are almost identical (code reused, actually). I tend to think of a PooledString as a looser version of a CategoricalVariable, particularly the versions that use the global pool. We could provide conversions to CategoricalVariables or maybe we could just share API or traits.

For ordinal types, more effort is needed. As with John's code, an OrdinalVariable could wrap a PooledString or a CategoricalVariable.

@nalimilan
Copy link
Member

Though even without ordinal variables, you often want an ordering, be it to present the results in a table or a plot, or to choose the default contrasts in models. When the pool is specific to a variable, you get the ordering for free. But when the pool is global, you need to store the order in another way if you want it.

@alyst
Copy link
Contributor

alyst commented Dec 9, 2015

The benchmarks for the pooled strings looks awesome! It's definitely something that should be implemented in the long run.
IIUC, from the DataFrame's perspective it only matters if the column is of categorical type or not (to optimize joining/grouping). So maybe something like AbstractCategoricalArray type could be defined to decouple DataFrame from the concrete implementation and value management policy.

@tshort
Copy link
Contributor Author

tshort commented Dec 10, 2015

@alyst, I agree on trying to decouple DataFrames from concrete implementations.

@nalimilan, I agree on wanting ordering. I think it needs to be in an extra step. For example, we can probably support this in two ways:

  1. Use orderby to change the order of a grouping. You can do this now in DataFramesMeta.
  2. Convert a globally pooled column to something else. That could be a CategoricalVariable, an OrdinalVariable, or even another PooledStringArray. Those could be pretty quick operations. For the last one, we could provide a utility to compress the pool to unique values and reorder the levels.

@tshort
Copy link
Contributor Author

tshort commented Dec 11, 2015

See the following for a more fleshed-out package:

https://github.com/tshort/PooledElements.jl

Not much for docs, but you can look at the tests to get some idea of the features. The Nullables features have been fleshed out more, trying to be compatible with NullableArrays.

@tshort
Copy link
Contributor Author

tshort commented Dec 14, 2015

Overall, things seem to be coming together well. Saving and loading with JLD works. Ordering and converting between pools is working well and seems efficient. Here is an example of converting to use of another Pool and then an example of converting to a sorted Pool:

x = PooledStringArray(Pool(["x", "a"]), ["b", "c", "a"])
y = repool(x, Pool())
z = repool(y, Pool(sort(levels(y))))

The biggest issue remains garbage collection. I've left it open to support different Pools, so a GC-enabled pool could be added. The biggest headache that would follow is that each type that stored references to a GarbageCollectedPool would have to notify the pool type of any additions or deletions to pool references, and it would need a finalizer to delete references as appropriate. That's not too bad for the couple of types we have currently, but it would make it more difficult to add types that support pools. So for now, I think it's best to leave it out and do our best to support non-global pools and conversion between pools.

@quinnj
Copy link
Member

quinnj commented Sep 7, 2017

@nalimilan, is there anything else to do here?

@nalimilan
Copy link
Member

nalimilan commented Sep 7, 2017

I think this still needs considering/discussing. Let me restate the idea. In the majority of cases, data frame columns contain categorical data, defined in the broad sense as a vector with a small number of unique values. It would be silly not to take into account this situation, in particular when importing data from CSV, since storing references to a pool instead of the actual string radically improves performance regarding memory use and efficiency of processing (e.g. for merging, joining and grouping).

That said, we already have CategoricalArray, which mostly fits that use case. The only differences between CategoricalArray and PooledArray are:

  1. getindex on CategoricalArrays returns a CategoricalValue{String} rather than a plain String
  2. CategoricalArrays store an ordered set of levels, which can be a superset of the values actually present in the data (this is really not an issue since unique still behaves the same as for other arrays, as opposed to the special levels function)

So overall the difference is quite limited. I think point 1. could be further alleviated by making CategoricalValue <: AbstractString and implementing the full interface, so that the difference is mostly invisible for users except for < and isless (which reflect the order of levels, which is by default that of sort, i.e. it's consistent with string comparisons). I've been willing to do that for some time anyway since that can be convenient.

The remaining question is: what should we do by default e.g. in CSV.jl? As I said, in most cases pooling strings should increase performance a lot. But should we do that by default? Probably. For example, R does that under the hood for all strings. On the other hand, R also creates categorical arrays (factor) by default when creating data frames, and that's considered as one of the biggest flaws in the language (1, 2, 3, 4...). So I wouldn't want to repeat the same mistake in Julia. However, there are notable differences from R:

  • in R, any operation on data frame will convert your string column to a factor; we would never do that
  • R factors are inconvenient/buggy in many ways; in particular, you cannot assign a value which is not already in the levels, and the underlying implementation as integer codes is leaky (sometimes you get an array of integers rather than of strings)

Finally, pooling strings in general (whether as categorical or not) may not be a good idea in all cases: if lots of strings are unique (e.g. short texts, person names...) and you modify them, the pool will keep growing without any performance gain. Dropping unused strings or using reference counting will only add overhead.

So overall this is a complex issue. I think we should at the minimum add a categorical::Bool keyword argument to CSV.read, which when true will turn all non-numeric columns into CategoricalArray. Whether to use true or false as the default is debatable. false is certainly the safest choice, but that would mean performance wouldn't be optimal by default for mainly categorical data sets; maybe that's fine. A more general solution would be to have a stringtype argument which could be WeakRefString (replacing weakrefstrings=true), String, CategoricalValue or PooledString. Users could also choose the type of individual columns using the existing types argument.

Overall I think that's not really an issue in DataFrames itself (except maybe for adding support for efficient PooledArray columns hashing just like for CategoricalArray), but rather in DataStreams, CSV and other data import packages. But it's worth coordinating.

@nalimilan
Copy link
Member

After JuliaData/CSV.jl#102 CSV.jl will return CategoricalArrays for string columns with less than 66% of unique values. To make this more practical, JuliaData/CategoricalArrays.jl#77 makes CategoricalValue an AbstractString, so that the difference between an Array{String} and a CategoricalArray{String} should be limited to levels, their ordering, and of course efficiency.

So let's close this, we can always revisit this if experience shows we need yet another type for pooled strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants