Finding the optimal alignment of two sequences

The optimal alignment of two protein sequences is the alignment that maximises the sum of pair-scores less any penalty for introduced gaps. The problem is to find an efficient way of locating an alignment that satisfies these conditions. If no gaps are allowed, then the task is simple, just slide one sequence over the other and for each position, sum the pair-scores from the chosen matrix (e.g. BLOSUM62). This simple alignment algorithm requires of the order of 200 alternative alignments to be evaluated for two sequences of length 100. In contrast, if gaps are allowed, then there are >10⁷⁵ alternative alignments that must be considered! An elegant solution to this intractable search problem is given by a class of computer algorithm known as dynamic programming. Dynamic programming allows the optimal alignment of two sequences to be found in of the order of mnsteps, where m and n are the lengths of the sequences. Dynamic programming for sequence comparison was independetly invented in several fields, many of which are discussed in Sankoff and Kruskal's book [Sankoff & Kruskal, 1983]. An introduction to dynamic programming in the wider context of string comparison can be found in Gusfield (1997) . Needleman and Wunsch (1970) are often attributed as the first application of dynamic programming in molecular biology, while slightly different formulations of the same algorithm were described by Sellers [Sellers, 1974] and Waterman et al [Waterman et al., 1976]. Here, the basic Sellers [Sellers, 1974] dynamic programming algorithm for two sequences is described, with a length-dependent gap-penalty. A simple example of this algorithm is illustrated and explained in Figure 1.

At each aligned position between two sequences $A = (A_{1}, A_{2},\ldots A_{m})$ , and $B = (B_{1}, B_{2}, \ldots B_{n})$ of length m and n, there are three possible events:

substitution of A_i by B_j (scores w_{A_i, B_j}),

deletion of B_j (scores $w_{\Delta,B_{j}}$ ),

and deletion of A_i (scores $w_{A_{i},\Delta}$ ).

where $\Delta$ is the symbol for a single residue gap. The substitution score ( w_{A_i, B_j}) is taken from the amino-acid pair-score matrix such as BLOSUM62 [Henikoff & Henikoff, 1992]. The two deletion scores are the penalty for aligning a single residue with a single gap.

The total score M for the alignment of A with B can be represented as $s(A_{1 \ldots m},B_{1 \ldots n})$ . This is found by working along each protein sequence from the N to C terminus, successively finding the best score for aligning $A_{1 \ldots i}$ with $B_{1 \ldots j}$ for all i, j where $1 \leq i \leq m$ and $1 \leq j \leq n$ . The values of $s(A_{1 \ldots i},B_{1 \ldots j})$ are stored in a matrix H where each element of H is calculated as follows:

$\begin{displaymath} H_{i,j} = \max \left\{ \begin{array}{c} H_{i-1,j-1} + w_{A... ...j}}\\ H_{i-1,j} + w_{A_{i},\Delta}\\ \end{array} \right\} \end{displaymath}$

(1)

Once the matrix is complete, the element H_m,n contains the total score for the alignment of the complete sequences.

The processing steps described above determine the best score possible for the alignment of the two sequences given the gap-penalty and pair-score matrix. However, they do not give the alignment. In order to generate an alignment of A and B that gives the best score, it is necessary to trace back through H. An efficient way of completing the trace-back is to save an array of pointers that records which of the three possibilities was the maximum at each H_i,j. It is possible for there to be ties for the maximum chosen at any H_i,j. Ties represent equally valid alternative alignments and the manner in which ties are resolved is dependent on the program implementation. For this reason, two computer programs that correctly claim to implement the same dynamic programming algorithm and give the same total score for the alignment of two sequences, may produce subtly different alignments.

Calculation of the best score is more complex if gap-penalties of the form ul + v are employed since it is necessary to determine for each cell whether a gap is starting at that cell, or continuing from an earlier cell. This algorithm typically runs a factor of 3 slower than the simple algorithm.