The early personal computers had insufficient memory and were too slow
to carry out a database scan using dynamic programming. Accordingly, Wilbur and
Lipman [63] developed a fast procedure for DNA scans that
in concept searches for the most significant diagonals in a dot-plot.
The initial step in the algorithm is to identify all exact matches of
length (
-tuples) or greater between the two sequences. Speed
is achieved by employing a look up procedure. For example, for
proteins, if
then there are 8,000 (
) possible
-tuples and each element of an array
of length 8,000 is set to
represent one of these
-tuples. Sequence
is scanned once and
the location of each
-tuple in
is recorded in the corresponding
element of
. Sequence
is then scanned and by reference to
the location of all
-tuple matches common to
and
may be
identified. If two
-tuples are present on the same diagonal then
the difference between their starting position (offset) is also the
same, thus the diagonals with the most significant number of matches
may be identified. Since runs of identity are relatively rare even
between related proteins, Lipman and Pearson [64] first
identified the five diagonals of highest similarity with
set to 1,
or 2. They then applied Dayhoff's scoring scheme
(Section 2.4.1) to the amino acid pairs over these regions.
The region giving the highest score for the protein comparison was
used to rank order the sequences located in the databank for further
study by more rigorous procedures. Pearson and Lipman
[65] have refined these ideas in the program FASTA. FASTA
saves the 10 highest regions of identity which are then re-scored with
the PAM250 matrix (see Section 2.4.1). If there are
several initial regions above a pre set cutoff score then those that
could form a longer alignment are joined, allowing for gaps and a
score initn is calculated by subtracting a penalty for each gap.
initn is used to rank the database sequences by similarity.
Finally, dynamic programming is used over a narrow region of the high
scoring diagonal to produce an alignment with score opt. These steps
are illustrated in Figure 9.
FASTA only shows the top scoring region, it does not locate all high scoring alignments between two sequences. As a consequence FASTA may not identify directly repeats or multiple domains that are shared between two proteins. The FASTA software can be obtained by anonymous ftp from virginia.edu and a number of sites offer searching facilities with FASTA (e.g. EMBL).