When using a word-processor, it is often necessary to search for a
specific word in the text we have written so far. Such a search tends to
take time proportional to the length of the text. This makes
intuitive sense. If we wanted to find all occurrences of the word
Napoleon in Tolstoy's War and Peace, we would need to
scan all the pages of his mighty novel. Or would we? We might be
reading an annotated edition that includes an exhaustive index. In
this case it would only take time proportional to the length of
Napoleon to find all of his appearances in War and
Peace.
Figure 1:
Three-dimensional structure of hemoglobin and, to the
right, a suffix tree
of the string PEEK, which is a subsequence of the beta-globin
amino acid sequence. The table below the protein structure displays
the positions in the string PEEK.
|
A suffix tree is a data structure for finding patterns, such
as Napoleon, in a text, such as War and
Peace, in time proportional to the length of the pattern
[12,1]. The texts
we tend to be most interested in as bioinformaticians are DNA or
protein
sequences, which in many cases surpass the length of War and
Peace by orders of magnitude. Figure 1 shows a suffix
tree for a short fragment of the beta globin amino acid sequence,
PEEK. A suffix is the “end” of a
word. For example EEK as well as
PEEK are suffices of the string
PEEK. The suffix tree itself consists of nodes
and edges. There are three kinds of nodes:
- the rood node: topmost node in tree, has no parent
node; brown node in Figure 1;
- internal nodes: nodes connected to one parent and two or more
child nodes; blue node in Figure 1
- leaf nodes: nodes connected to one parent and no child nodes;
green nodes in Figure 1.
The edges are labeled with substrings of the input string
and the leaf nodes are labeled by numbers. The defining feature of
a suffix tree is that the edge labels read along the path leading
from the root to a leaf yield
the suffix starting at the leaf's label. For example, the path leading
from the root to leaf #3 is labeled by EK, which is the
suffix starting at position three of the input string. Now, a string
represented by a suffix tree can be searched for a pattern in the
following manner: Start at the root and match the first character of the
pattern with the first letter of one of the edges emerging from the
root. If no match is found, we are done and know that the pattern does
not appear in the entire text. If a match is found, take the second
character of the pattern and match it to the second character along
the edge identified in the first step. Continue in this way until the
entire pattern is found in the tree or a mismatch occurs. If the
entire pattern is found, its positions in the text are looked up by
visiting the leaves in the subtree rooted on the node connected to the
edge on which the last letter of the pattern matched. The labels of
the leaves thus uncovered are the starting positions of the pattern in
the text. For example, if we search for the pattern in
the string represented by the suffix tree shown in Figure
1, we start at the root and are done after one matching
step. The leaves of the subtree rooted on the single internal node in
the suffix tree are labeled by 2 and 3, corresponding to the starting
positions of our pattern in the text.
A program for interactively drawing suffix trees can be accessed
here.
We are have used suffix trees and their compact cousins suffix arrays
to estimate evolutionary distances [8], detect
recombination [3], and search for diagnostic markers [2].