Suffix tree

Image:Suffix tree BANANA.png
Suffix tree for the string "BANANA$". Suffix links drawn dashed.

A suffix tree for an string S of n characters is a Patricia trie containing all n suffixes of S. It was first described by Weiner in 1973. This data structure that can built in Θ(n) time, and gives you all occ occurrences of a pattern P of length m in O(m + occ) time, which is optimal. You can also extract a lot of information about your text data with a suffix tree. It was one of the first linear-time solutions for the longest common substring problem.

Contents

Functionality

A suffix tree for a string S of length n can be built in Θ(n) time, if the alphabet is viewed as a constant. Otherwise, the construction time depends on the implementation. Assume that a suffix tree has been built for the string S, or that a generalised suffix tree has been built for the set of strings D={S1, S2, ..., Sd} of total length n = |S1| + |S2| + ... + |Sd|. Assume that the size of the alphabet Σ is σ.

You can:

  • Check if a string P of length m is a substring of any of the strings in D in O(m) time.
  • Find all occ occurrences of P in D in O(m + occ) time.
  • Search for a regular expression P in time expected sublinear on D.
  • Search for a pattern P in D with x mismatches in O(m σ x + occ) time.
  • Find the longest common substrings of all strings in D in O(n) time.
  • Find the most frequently occurring substrings of a minimum length in O(n) time.
  • Find the shortest strings from Σ that do not occur in D, in O(n) time. (There might be O(n σ) such strings, so it takes O(n σ) time to report them explicitly.

Uses

Suffix trees are mainly used in bioinformatics applications, where they are used for searching for patterns in DNA or protein sequences, which can be viewed as long strings of characters. The ability to search efficiently with mismatches might be the suffix tree's greatest strength. It is also used in data compression, where it is used to find repeated data. Variants of the LZW compression schemes use it (LZSS).

Description

Before a string is indexed with a suffix tree, it is usually padded with a unique terminator, to make sure no suffix is a substring of another. The suffix tree for the string S is defined as the tree which has the property that edges spell non-empty strings, the paths from the root to the leaves have a one-to-one relationship with the suffixes of S, and all internal nodes have at least two children. There are n+1 suffixes of the padded string, and hence n+1 leaf nodes. Since all internal nodes are branching, there can be at most 2n+1 nodes. If each node and edge can be represented in Θ(1) space, the entire tree can be represented in Θ(n) space. The total length of the edges in the tree is O(n2), but each edge can be stored as the position and length of a substring of S.

The main choice when making a suffix tree implementation is the parent--child relationships between nodes. The most common is using linked lists called sibling lists. Each node has pointer to its first child, and to the next node in the child list it is a part of. Suffix tree construction time is O(n σ), and search time is O(m σ + occ). Hash maps and arrays may also be used, giving different running time properties.

Suffix links are a key feature for linear-time construction of the tree, since they allow changes to propagate to all suffixes quickly. In a complete suffix tree, all internal non-root nodes have a suffix link to another internal node. If the path from the root to a node spells the string ANAN, it has a suffix link to the internal node representing NAN. These suffix links are used both in construction and in some algorithms running on the tree.

The large amount of information in each edge and node makes the suffix tree very expensive, consuming some twenty times the memory size of the source text in common implementations. The suffix array reduces this requirement to a factor of four, and researchers have continued to find smaller indexing structures.

References

  • E.M. McCreight. (1976). A space-economical suffix tree construction algorithm. Journal of the ACM 23 262-272.
  • E. Ukkonen. (1995). On-line construction of suffix trees. Algorithmica 14(3):249-260. PDF

External links


de:Suffixbaum