Next: Guidelines for Database Up: Database scanning Previous: Approximations: The FASTP

Approximations: BLAST Basic Local Alignment Search Tool

BLAST [18] is a heuristic method to find the highest scoring locally optimal alignments between a query sequence and a database. The important simplification that BLAST makes is not to allow gaps, but the algorithm does allow multiple hits to the same sequence. The BLAST algorithm and family of programs rely on work on the statistics of ungapped sequence alignments by Karlin and Altschul [66]. The statistics allow the probability of obtaining an ungapped alignment (MSP - Maximal Segment Pair) with a particular score to be estimated. The BLAST algorithm permits nearly all MSP's above a cutoff to be located efficiently in a database.

The algorithm operates in three steps:

  1. For a given word length (usually 3 for proteins) and score matrix (see Section 2) a list of all words (-mers) that can can score when compared to -mers from the query is created.

  2. The database is searched using the list of -mers to to find the corresponding -mers in the database (hits).

  3. Each hit is extended to determine if an MSP that includes the -mer scores , the preset threshold score for an MSP. Since pair score matrices typically include negative values, extension of the initial -mer hit may increase or decrease the score. Accordingly, a parameter defines how great an extension will be tried in an attempt to raise the score above .

    The steps involved in the BLAST algorithm are illustrated in Figure 10.

A low value for reduces the possibility of missing MSPs with the required score, however lower values also increase the size of the hit list generated in step 2 and hence the execution time and memory required. In practice, the BLASTP program used for protein searches sets compromise values of and to balance the processor requirements and sensitivity.

BLAST is unlikely to be as sensitive for all protein searches as a full dynamic programming algorithm. However, the underlying statistics provide a direct estimate of the significance of any match found. The program was developed at the NCBI and benefits from strong technical support and continuing refinement. For example, filters have recently been developed to exclude automatically regions of the query sequence that have low compositional complexity, or short periodicity internal repeats. The presence of such sequences can yield extremely large numbers of statistically significant but biologically uninteresting MSPs. For example, searching with a sequence that contains a long section of hydrophobic residues will find many proteins with transmembrane helices.

BLAST runs on virtually all types of computer (the program may be obtained by anonymous ftp from ncbi.nlm.nih.gov). Parallel processing is also supported on multi-processor computers such as the 4 or 8-processor Silicon Graphics POWER series. Searches using BLAST may be conducted by email at NCBI (Send the text HELP to blast@ncbi.nlm.nih.gov for instructions). Recently, specialist hardware has been developed to implement the BLAST algorithm at even higher speed. A network service based on these developments is available from the University of North Carolina at Chapel Hill (Send the text HELP on the subject line to bioscan@cs.unc.edu).



Next: Guidelines for Database Up: Database scanning Previous: Approximations: The FASTP


geoff.barton@ox.ac.uk