BLAST [18] is a heuristic method to find the highest scoring locally optimal alignments between a query sequence and a database. The important simplification that BLAST makes is not to allow gaps, but the algorithm does allow multiple hits to the same sequence. The BLAST algorithm and family of programs rely on work on the statistics of ungapped sequence alignments by Karlin and Altschul [66]. The statistics allow the probability of obtaining an ungapped alignment (MSP - Maximal Segment Pair) with a particular score to be estimated. The BLAST algorithm permits nearly all MSP's above a cutoff to be located efficiently in a database.
The algorithm operates in three steps:
The steps involved in the BLAST algorithm are illustrated in Figure 10.
A low value for reduces the possibility of missing MSPs with the
required
score, however lower
values also increase the size of
the hit list generated in step 2 and hence the execution time and
memory required. In practice, the BLASTP program used for protein
searches sets compromise values of
and
to balance the
processor requirements and sensitivity.
BLAST is unlikely to be as sensitive for all protein searches as a full dynamic programming algorithm. However, the underlying statistics provide a direct estimate of the significance of any match found. The program was developed at the NCBI and benefits from strong technical support and continuing refinement. For example, filters have recently been developed to exclude automatically regions of the query sequence that have low compositional complexity, or short periodicity internal repeats. The presence of such sequences can yield extremely large numbers of statistically significant but biologically uninteresting MSPs. For example, searching with a sequence that contains a long section of hydrophobic residues will find many proteins with transmembrane helices.
BLAST runs on virtually all types of computer (the program may be obtained by anonymous ftp from ncbi.nlm.nih.gov). Parallel processing is also supported on multi-processor computers such as the 4 or 8-processor Silicon Graphics POWER series. Searches using BLAST may be conducted by email at NCBI (Send the text HELP to blast@ncbi.nlm.nih.gov for instructions). Recently, specialist hardware has been developed to implement the BLAST algorithm at even higher speed. A network service based on these developments is available from the University of North Carolina at Chapel Hill (Send the text HELP on the subject line to bioscan@cs.unc.edu).