-
Notifications
You must be signed in to change notification settings - Fork 87
Suffix Array thoughts
Suffix Array is a sorted array of all suffixes of a string. This data structure is useful for string algorithms including lossless data compression and string sorting Example
Given a string "banana$", the suffixes are as follows.
s[0]="banana$"
s[1]="anana$"
s[2]="nana$"
s[3]="ana$"
s[4]="na$"
s[5]="a$"
s[6]="$"
The suffix array of string "banana$" is the array of indices of sorted suffixes.
s[6]="$"
s[5]="a$"
s[3]="ana$"
s[1]="anana$"
s[0]="banana$"
s[4]="na$"
s[2]="nana$"
so sa=[6,5,3,1,0,4,2]
Arkouda has pull requests (See PR #865 and #627) which implement suffix arrays with linear time construction
The current implementation has suffix array as a standalone object with the client having disjoint SArrays
and Strings
classes. I've started thinking that suffix array construction should be framed as an operation performed on a Strings
object.
My thought process is a suffix array is an intermediary representation of a string useful for some algorithms and doesn't make sense without an associated string. So instead of converting Strings
objects into SArrays
, we could have the suffix array construction methods reside in the strings class.
There are 2 approaches that come to mind when making this happen:
- Have all algorithms which require suffix arrays to create them on the fly as necessary
- This shouldn't be an algorithmic bottle neck since the construction is linear
- Automatically create the suffix array for all strings objects and store this on the server side
- Increased memory required to avoid repeated construction of suffix arrays if multiple algorithm requiring them are called on the same SegString
- This extra memory is already happening with building
SArrays
objects fromStrings
I am in favor of building suffix arrays as needed while we are implementing the algorithms which require them. We can then do a bake-off to see how the performance compares with our current non-suffix array methods. If these perform better, we can look into having more permanent storage for suffix arrays as an additional component in SegStrings on the server to see if this is worthwhile