Skip to content

Suffix Array thoughts

pierce314159 edited this page Jul 19, 2021 · 5 revisions

Quick introduction

Suffix Array is a sorted array of all suffixes of a string. This data structure is useful for string algorithms including lossless data compression and string sorting Example

Given a string "banana$", the suffixes are as follows.
    s[0]="banana$"
    s[1]="anana$"
    s[2]="nana$"
    s[3]="ana$"
    s[4]="na$"
    s[5]="a$"
    s[6]="$"
The suffix array of string "banana$"  is the array of indices of sorted suffixes.
    s[6]="$"
    s[5]="a$"
    s[3]="ana$"
    s[1]="anana$"
    s[0]="banana$"
    s[4]="na$"
    s[2]="nana$"
so sa=[6,5,3,1,0,4,2]

Arkouda has pull requests (See PR #865 and #627) which implement suffix arrays with linear time construction

Suffix Array construction as an operation on Strings

The current implementation has suffix array as a standalone object with the client having disjoint SArrays and Strings classes. I've started thinking that suffix array construction should be framed as an operation performed on a Strings object.

My thought process is a suffix array is an intermediary representation of a string useful for some algorithms and doesn't make sense without an associated string. So instead of converting Strings objects into SArrays, we could have the suffix array construction methods reside in the strings class.

There are 2 approaches that come to mind when making this happen:

  • Have all algorithms which require suffix arrays to create them on the fly as necessary
    • This shouldn't be an algorithmic bottle neck since the construction is linear
  • Automatically create the suffix array for all strings objects and store this on the server side
    • Increased memory required to avoid repeated construction of suffix arrays if multiple algorithm requiring them are called on the same SegString
    • This extra memory is already happening with building SArrays objects from Strings

I am in favor of building suffix arrays as needed while we are implementing the algorithms which require them. We can then do a bake-off to see how the performance compares with our current non-suffix array methods. If these perform better, we can look into having more permanent storage for suffix arrays as an additional component in SegStrings on the server to see if this is worthwhile