String Searching

When using a word-processor, it is often necessary to search for a word or a phrase in the text we have written so far. Such a search usually takes time proportional to the length of the text. This makes intuitive sense. If we wanted to find all occurrences of, say, Voldemort Tolstoy's Harry Potter, we'd need to scan all the seven volumes. Or would we? We might be reading an annotated edition that includes an exhaustive index. In that case it would only take time proportional to the length of Voldemort to find all of his appearances in Harry Potter. CONT
Figure: Suffix tree of CGTACGCGGTGGG.
1#1

A suffix tree is a data structure for finding patterns, such as Voldemort, in a text, such as Harry Potter, in time proportional to the length of the pattern [12,1]. The texts we tend to be most interested in as bioinformaticians are DNA or protein sequences, which in many cases surpass the length of Harry Potter by orders of magnitude. Figure [*] shows a suffix tree for the short DNA sequence CGTACGCGGTGGG taken from the 3.2 billion base pairs that make up the human genome. Because the last character in a string is invisible, it is usually visualized by a $ in suffix trees. A suffix is the “end” of a word. For example GGG as well as TGGG are suffices of the string CGTACGCGGTGGG. A suffix tree consists of nodes and edges. There are three kinds of nodes:

  1. the rood node at the top;
  2. internal nodes have one parent and two or more children;
  3. leaf nodes have one parent and no children.
The edges of a suffix tree are labeled with substrings of the input string and the leaf nodes are labeled by numbers. When read along the path leading from the root to a leaf, the edge labels yield the suffix starting at the leaf's label. For example, the path leading from the root to leaf #11 is labeled by GGG, which is the suffix starting at position 11 of the input string.

Given a suffix tree, we can search for a pattern in the following manner: Start at the root and match the first character of the pattern with the first letter of one of the edges emerging from the root. If no match is found, we are done and know that the pattern does not appear in the entire text. If a match is found, take the second character of the pattern and match it to the second character along the edge identified in the first step. Continue in this way until the entire pattern is found in the tree or a mismatch occurs. If the entire pattern is found, its positions in the text are looked up by visiting the leaves below the last match. The labels of these leaves are the starting positions of the pattern in the text. For example, if we search for the pattern 2#2, we start at the root and are done after two matching steps. The leaves below the last match tell us that CG occurs at positions 5, 7, and 1.

A program for interactively drawing suffix trees can be accessed here.

We are have used suffix trees and their compact cousins suffix arrays to estimate evolutionary distances [8], detect recombination [3], and search for diagnostic markers [2].