Next: Approximations: BLAST Basic Up: Database scanning Previous: Indexing with gaps

Approximations: The FASTP and FASTA algorithm

The early personal computers had insufficient memory and were too slow to carry out a database scan using dynamic programming. Accordingly, Wilbur and Lipman [63] developed a fast procedure for DNA scans that in concept searches for the most significant diagonals in a dot-plot. The initial step in the algorithm is to identify all exact matches of length (-tuples) or greater between the two sequences. Speed is achieved by employing a look up procedure. For example, for proteins, if then there are 8,000 () possible -tuples and each element of an array of length 8,000 is set to represent one of these -tuples. Sequence is scanned once and the location of each -tuple in is recorded in the corresponding element of . Sequence is then scanned and by reference to the location of all -tuple matches common to and may be identified. If two -tuples are present on the same diagonal then the difference between their starting position (offset) is also the same, thus the diagonals with the most significant number of matches may be identified. Since runs of identity are relatively rare even between related proteins, Lipman and Pearson [64] first identified the five diagonals of highest similarity with set to 1, or 2. They then applied Dayhoff's scoring scheme (Section 2.4.1) to the amino acid pairs over these regions. The region giving the highest score for the protein comparison was used to rank order the sequences located in the databank for further study by more rigorous procedures. Pearson and Lipman [65] have refined these ideas in the program FASTA. FASTA saves the 10 highest regions of identity which are then re-scored with the PAM250 matrix (see Section 2.4.1). If there are several initial regions above a pre set cutoff score then those that could form a longer alignment are joined, allowing for gaps and a score initn is calculated by subtracting a penalty for each gap. initn is used to rank the database sequences by similarity. Finally, dynamic programming is used over a narrow region of the high scoring diagonal to produce an alignment with score opt. These steps are illustrated in Figure 9.

FASTA only shows the top scoring region, it does not locate all high scoring alignments between two sequences. As a consequence FASTA may not identify directly repeats or multiple domains that are shared between two proteins. The FASTA software can be obtained by anonymous ftp from virginia.edu and a number of sites offer searching facilities with FASTA (e.g. EMBL).



Next: Approximations: BLAST Basic Up: Database scanning Previous: Indexing with gaps


geoff.barton@ox.ac.uk