String Searching

When using a word-processor, it is often necessary to search for a specific word in the text we have written so far. Such a search tends to take time proportional to the length of the text. This makes intuitive sense. If we wanted to find all occurrences of the word Napoleon in Tolstoy's War and Peace, we would need to scan all the pages of his mighty novel. Or would we? We might be reading an annotated edition that includes an exhaustive index. In this case it would only take time proportional to the length of Napoleon to find all of his appearances in War and Peace.
Figure: Three-dimensional structure of hemoglobin and, to the right, a suffix tree of the string PEEK, which is a subsequence of the beta-globin amino acid sequence. The table below the protein structure displays the positions in the string PEEK.
45
\begin{tabular}{\vert c\vert c\vert c\vert c\vert}
\hline
1 & 2 & 3 & 4 \hli...
...thtt{E}$& $\mathtt{K}$\\
\hline
\end{tabular} \vspace{0.5cm}\hspace{0.7cm}

A suffix tree is a data structure for finding Patterns, such as Napoleon, in a text, such as War and Peace, in time proportional to the length of the pattern [10,1]. The texts we tend to be most interested in as bioinformaticians are DNA or protein sequences, which in many cases surpass the length of War and Peace by orders of magnitude. Figure [*] shows a suffix tree for a short fragment of the beta globin amino acid sequence, PEEK. A suffix is the ``end'' of a word. For example EEK as well as PEEK are suffices of the string PEEK. The suffix tree itself consists of nodes and edges. There are three kinds of nodes:

  1. the rood node: topmost node in tree, has no parent node; brown node in Figure [*];
  2. internal nodes: nodes connected to one parent and two or more child nodes; blue node in Figure [*]
  3. leaf nodes: nodes connected to one parent and no child nodes; green nodes in Figure [*].
The edges are labeled with substrings of the input string and the leaf nodes are labeled by numbers. The defining feature of a suffix tree is that the edge labels read along the path leading from the root to a leaf yield the suffix starting at the leaf's label. For example, the path leading from the root to leaf #3 is labeled by EK, which is the suffix starting at position three of the input string. Now, a string represented by a suffix tree can be searched for a pattern in the following manner: Start at the root and match the first character of the pattern with the first letter of one of the edges emerging from the root. If no match is found, we are done and know that the pattern does not appear in the entire text. If a match is found, take the second character of the pattern and match it to the second character along the edge identified in the first step. Continue in this way until the entire pattern is found in the tree or a mismatch occurs. If the entire pattern is found, its positions in the text are looked up by visiting the leaves in the subtree rooted on the node connected to the edge on which the last letter of the pattern matched. The labels of the leaves thus uncovered are the starting positions of the pattern in the text. For example, if we search for the pattern $\mathtt{E}$ in the string represented by the suffix tree shown in Figure [*], we start at the root and are done after one matching step. The leaves of the subtree rooted on the single internal node in the suffix tree are labeled by 2 and 3, corresponding to the starting positions of our pattern in the text.

A program for interactively drawing suffix trees can be accessed here.

We are currently using suffix trees in comparative genome analysis [3,5,,].

Bernhard Haubold 2016-04-14