A suffix tree is a data structure for finding patterns, such as Voldemort, in a text, such as Harry Potter, in time proportional to the length of the pattern [12,1]. The texts we tend to be most interested in as bioinformaticians are DNA or protein sequences, which in many cases surpass the length of Harry Potter by orders of magnitude. Figure shows a suffix tree for the short DNA sequence CGTACGCGGTGGG taken from the 3.2 billion base pairs that make up the human genome. Because the last character in a string is invisible, it is usually visualized by a $ in suffix trees. A suffix is the “end” of a word. For example GGG as well as TGGG are suffices of the string CGTACGCGGTGGG. A suffix tree consists of nodes and edges. There are three kinds of nodes:
Given a suffix tree, we can search for a pattern in the following manner: Start at the root and match the first character of the pattern with the first letter of one of the edges emerging from the root. If no match is found, we are done and know that the pattern does not appear in the entire text. If a match is found, take the second character of the pattern and match it to the second character along the edge identified in the first step. Continue in this way until the entire pattern is found in the tree or a mismatch occurs. If the entire pattern is found, its positions in the text are looked up by visiting the leaves below the last match. The labels of these leaves are the starting positions of the pattern in the text. For example, if we search for the pattern 2#2, we start at the root and are done after two matching steps. The leaves below the last match tell us that CG occurs at positions 5, 7, and 1.
A program for interactively drawing suffix trees can be accessed here.
We are have used suffix trees and their compact cousins suffix arrays to estimate evolutionary distances [8], detect recombination [3], and search for diagnostic markers [2].