Skip to content

Suffix Array thoughts

pierce314159 edited this page Jul 19, 2021 · 5 revisions

Quick introduction

A Suffix Array is a sorted array of all suffixes of a string. This data structure is useful for certain string algorithms including lossless data compression and string sorting

Example:
Given a string "banana$", the suffixes are as follows
    s[0]="banana$"
    s[1]="anana$"
    s[2]="nana$"
    s[3]="ana$"
    s[4]="na$"
    s[5]="a$"
    s[6]="$"
The suffix array of string "banana$" is the array of indices of sorted suffixes
    s[6]="$"
    s[5]="a$"
    s[3]="ana$"
    s[1]="anana$"
    s[0]="banana$"
    s[4]="na$"
    s[2]="nana$"
so sa=[6,5,3,1,0,4,2]

Arkouda has pull requests (See PR #865 and #627) which implement suffix arrays with linear time construction

Integrating Suffix Arrays with Strings

In the current implementation, suffix array is a standalone object with the client having disjoint SArrays and Strings classes. I've started thinking that suffix array construction should be an operation performed on a Strings object.

My reasoning is a suffix array is an intermediary representation which doesn't make sense without an associated string. There isn't a use case for having a suffix array without a string, so it doesn't need to be it's own class. Instead of converting Strings objects into SArrays, we could have the suffix array construction methods reside in the strings class.

There is the added benefit of reducing duplicated code since much of the suffix array class code (suffix_array.py and SegmentedSuffixArray.chpl) is very similar to strings class code (strings.py and SegmentedArray.chpl) with a few added functions

Approach

  • Have all algorithms which require suffix arrays to create them on the fly as necessary
    • This shouldn't be an algorithmic bottle neck since the construction is linear

We build suffix arrays as needed while we are implementing the algorithms which require them. We can then do a bake-off to see how the performance compares with our current non-suffix array methods. If these perform better, we can look into whether having more permanent storage for suffix arrays as an additional component in SegStrings on the server is worthwhile

  • Automatically create the suffix array as an additional component of SegStrings on the server
    • Increased memory required to avoid repeated construction of suffix arrays if multiple algorithms requiring them are called on the same SegString
    • Can be done using lazy initialization, only constructing the suffix arrays the first time a function is called.
    • This extra memory is already happening when building SArrays objects from Strings