BLAST(1L) MISC. REFERENCE MANUAL PAGES BLAST(1L) NAME blastp, blastn, blastx, tblastn, tblastx - rapid sequence database search programs utilizing the BLAST algorithm SYNOPSIS blastp aadb aaquery [E=#] [S=#] [E2=#] [S2=#] [W=#] [T=#] [X=#] [-matrix scorefile] [Y=#] [Z=#] [H=#] [V=#] [B=#] [-sort_by...] blastn ntdb ntquery [E=#] [S=#] [E2=#] [S2=#] [W=#] [T=#] [X=#] [ [[M=matchscore][N=mismatchpenalty]] [-matrix scorefile] ] [Y=#] [Z=#] [H=#] [V=#] [B=#] [[-top][-bottom]] [-sort_by...] blastx aadb ntquery [E=#] [S=#] [E2=#] [S2=#] [W=#] [T=#] [X=#] [-matrix scorefile] [Y=#] [Z=#] [C=#] [H=#] [V=#] [B=#] [[-top][-bottom]] [-sort_by...] tblastn ntdb aaquery [E=#] [S=#] [E2=#] [S2=#] [W=#] [T=#] [X=#] [-matrix scorefile] [Y=#] [Z=#] [-dbgcode #] [H=#] [V=#] [B=#] [[-dbtop][-dbbottom]] [-sort_by...] tblastx ntdb ntquery [E=#] [S=#] [E2=#] [S2=#] [W=#] [T=#] [X=#] [-matrix scorefile] [Y=#] [Z=#] [C=#] [-dbgcode #] [H=#] [V=#] [B=#] [[-top][-bottom]] [[-dbtop][-dbbottom]] [-sort_by...] DESCRIPTION This document describes the BLAST version 1.4 programs. BLAST (Basic Local Alignment Search Tool) is the heuristic search algorithm employed by the programs blastp, blastn, blastx, tblastn, and tblastx; these programs ascribe signi- ficance to their findings using the statistical methods of Karlin and Altschul (1990, 1993) with a few enhancements. The BLAST programs were tailored for sequence similarity searching -- for example to identify homologs to a query sequence. The programs are not generally useful for motif- style searching. For a discussion of basic issues in simi- larity searching of sequence databases, see Altschul _e_t _a_l. (1994). The five BLAST programs described here perform the following tasks: blastp compares an amino acid query sequence against a protein sequence database; blastn compares a nucleotide query sequence against a nucleotide sequence database; blastx compares the six-frame conceptual translation Sun Release 4.1 Last change: 20 October 1994 1 BLAST(1L) MISC. REFERENCE MANUAL PAGES BLAST(1L) products of a nucleotide query sequence (both strands) against a protein sequence database; tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands). tblastx compares the six-frame translations of a nucleo- tide query sequence against the six-frame transla- tions of a nucleotide sequence database. The fundamental unit of BLAST algorithm output is the High- scoring Segment Pair (HSP). An HSP consists of two sequence fragments of arbitrary but equal length whose alignment is locally maximal and for which the alignment score meets or exceeds a threshold or _c_u_t_o_f_f score. A set of HSPs is thus defined by two sequences, a scoring system, and a cutoff score; this set may be empty if the cutoff score is suffi- ciently high. In the programmatic implementations of the BLAST algorithm described here, each HSP consists of a seg- ment from the query sequence and one from a database sequence. The sensitivity and speed of the programs can be adjusted via the standard BLAST algorithm parameters W, T, and X (Altschul _e_t _a_l., 1990); selectivity of the programs can be adjusted via the cutoff score. A Maximal-scoring Segment Pair (MSP) is defined by two sequences and a scoring system and is the highest-scoring of all possible segment pairs that can be produced from the two sequences. The statistical methods of Karlin and Altschul (1990, 1993) are applicable to determining the significance of MSP scores in the limit of long sequences, under a random sequence model that assumes independent and identically dis- tributed choices for the residues at each position in the sequences. In the programs described here, Karlin-Altschul statistics have been extrapolated to the task of assessing the significance of HSP scores obtained from comparisons of potentially short, biological sequences. SEARCH STRATEGY The approach to similarity searching taken by the BLAST pro- grams is first to look for similar segments (HSPs) between the query sequence and a database sequence, then to evaluate the statistical significance of any matches that were found, and finally to report only those matches that satisfy a user-selectable threshold of significance. Findings of mul- tiple HSPs involving the query sequence and a single data- base sequence may be treated statistically in a variety of ways. By default the programs use "Sum" statistics (Karlin and Altschul, 1993). As such, the statistical significance ascribed to a set of HSPs may be higher than that ascribed Sun Release 4.1 Last change: 20 October 1994 2 BLAST(1L) MISC. REFERENCE MANUAL PAGES BLAST(1L) to any individual member of the set. Only when the ascribed significance satisfies the user-selectable threshold (E parameter) will the match be reported to the user. The task of finding HSPs begins with identifying short words of length W in the query sequence that either match or satisfy some positive-valued threshold score T when aligned with a word of the same length in a database sequence. T is referred to as the _n_e_i_g_h_b_o_r_h_o_o_d _w_o_r_d _s_c_o_r_e _t_h_r_e_s_h_o_l_d (Altschul _e_t _a_l., 1990). These initial neighborhood _w_o_r_d _h_i_t_s act as seeds for initiating searches to find longer HSPs containing them. The word hits are extended in both directions along each sequence for as far as the cumulative alignment score can be increased. Extension of the word hits in each direction are halted when: the cumulative alignment score falls off by the quantity X from its maximum achieved value; the cumulative score goes to zero or below, due to the accumulation of one or more negative-scoring residue alignments; or the end of either sequence is reached. SETTING PARAMETERS Many of the BLAST program parameters have one- or two-letter names and default values that can be modified using a _n_a_m_e=_v_a_l_u_e syntax on the command line, _e._g., E=0.05 or S2=35. Other command line options are flags that appear alone on the command line (_e._g., -_s_p_a_n). Parameter names are expected to be followed by a new value, separated from the parameter name by white space, as in -_f_i_l_t_e_r _s_e_g or -_d_b_r_e_c_m_a_x _1_0_5_0_0. An alternative parameter-value syntax sup- ported by the programs is illustrated in these examples: _f_i_l_t_e_r=_s_e_g and _d_b_r_e_c_m_a_x=_1_0_5_0_0. SELECTIVITY IN REPORTING MATCHES The parameter E establishes a statistical significance threshold for reporting database sequence matches. E is interpreted as the upper bound on the expected frequency of chance occurrence of an HSP (or set of HSPs) within the con- text of the entire database search. Any database sequence whose matching satisfies E is subject to being reported in the program output. If the query sequence and database sequences follow the random sequence model of Karlin and Altschul (1990), and if sufficiently sensitive BLAST algo- rithm parameters are used, then E may be thought of as the number of matches one expects to observe by chance alone during the database search. The default value for E is 10, while the permitted range for this Real valued parameter is 0 < E <= 1000. The parameter S represents the score at which a single HSP would by itself satisfy the significance threshold E. Higher scores -- higher values for S -- correspond to Sun Release 4.1 Last change: 20 October 1994 3 BLAST(1L) MISC. REFERENCE MANUAL PAGES BLAST(1L) increasing statistical significance (lower probability of chance occurrence). Unless S is explicitly set on the com- mand line, its default value is calculated from the value of E. If both S and E are set on the command line, the one which is the most restrictive is used. When neither parame- ter is specified on the command line, the default value for E is used to calculate S. The values for E and S are interconvertible, given the con- text of the search, which includes: the length and residue composition of the query sequence; the length of the data- base; a fixed, hypothetical residue composition for the database; and the scoring system employed. The scoring sys- tem used by the BLAST programs consists of a scoring matrix, wherein a score is ascribed to the alignment of each letter (residue) in the alphabet with every other letter in the alphabet as well as to itself. The significance of an alignment score depends intimately upon the specific scoring matrix employed and the length and residue composition of the query sequence and database, all of which may vary with each search performed. Instead of the having the user guess at an appropriate value for the cutoff score S for each search, an intuitive, general way to set thresholds for reporting matches is via the E parameter, which has the direct statistical interpretation mentioned above. KARLIN-ALTSCHUL STATISTICS From Karlin and Altschul (1990), the principal equation relating the score of an HSP to its expected frequency of chance occurrence is: _E = _K _N _e_x_p(-_L_a_m_b_d_a _S) where _E is the expected frequency of chance occurrence of an HSP having score _S (or one scoring higher); _K and _L_a_m_b_d_a are Karlin-Altschul parameters; _N is the product of the query and database sequence lengths, or the size of the search space; and _e_x_p is the exponentiation function. _L_a_m_b_d_a may be thought of as the expected increase in relia- bility of an alignment associated with a unit increase in alignment score. Reliability in this case is expressed in units of information, such as _b_i_t_s or _n_a_t_s, with one nat being equivalent to 1/log(2) (roughly 1.44) bits. The expectation _E (range 0 to infinity) calculated for an alignment between the query sequence and a database sequence can be extrapolated to an expectation over the entire Sun Release 4.1 Last change: 20 October 1994 4 BLAST(1L) MISC. REFERENCE MANUAL PAGES BLAST(1L) database search, by converting the pairwise expectation to a probability (range 0-1) and multiplying the result by the ratio of the entire database size (expressed in residues) to the length of the matching database sequence. In detail: _E__d_a_t_a_b_a_s_e = (_1 - _e_x_p(-_E)) _D / _d where _D is the size of the database; _d is the length of the matching database sequence; and the quantity (_1 - _e_x_p(-_E)) is the probability, _P, corresponding to the expectation _E for the pairwise sequence comparison. Note that in the limit of infinite _E, _P approaches 1; and in the limit as _E approaches 0, _E and _P approach equality. Due to inaccuracy in the statistical methods as they are applied in the BLAST programs, whenever _E and _P are less than about 0.05, the two values can be practically treated as being equal. In contrast to the random sequence model used by Karlin- Altschul statistics, biological sequences are often short in length -- an HSP may involve a relatively large fraction of the query or database sequence, which reduces the effective size of the 2-dimensional search space defined by the two sequences. To obtain more accurate significance estimates, the BLAST programs compute _e_f_f_e_c_t_i_v_e lengths for the query and database sequences that are their real lengths minus the expected length of the HSP, where the expected length for an HSP is computed from its score. In no event is an effective length for the query or database sequence permitted to go below 1. Thus, the effective length of either the query or the database sequence is computed according to the follow- ing: _L_e_n_g_t_h__e_f_f = MAX( _L_e_n_g_t_h__r_e_a_l - _L_a_m_b_d_a _S / _H , _1) where _H is the relative entropy of the target and background residue frequencies (Karlin and Altschul, 1990), one of the statistics reported by the BLAST programs. _H may be thought of as the information expected to be obtained from each pair of aligned residues in a real alignment that distinguishes the alignment from a random one. HSP SCORE THRESHOLDS Using the default parameters, many more aligned segment pairs are typically found by the BLAST programs than are ultimately reported. First, only those segment pairs scor- ing at or above a selectable cutoff score are saved as _b_o_n_a _f_i_d_e HSPs for further consideration of their statistical significance. And second, any HSPs that are found may not satisfy the significance threshold for reporting. The cutoff score which defines HSPs is parameterized as S2. A value for S2 can be set on the command line, or its value Sun Release 4.1 Last change: 20 October 1994 5 BLAST(1L) MISC. REFERENCE MANUAL PAGES BLAST(1L) can be set indirectly via the command line parameter E2. E2 is interpreted as the _e_x_p_e_c_t_e_d number of HSPs that will be found when comparing two sequences that each have the same length -- either 300 amino acids or 1000 nucleotides, which- ever is appropriate for the particular program being used. S2 may be thought of as the score expected for the MSP between two such sequences. The default value for E2 is typically about 0.15 but may vary from version to version of each program. The default value for S2 will be calculated from E2 and, like the relationship between E and S, is dependent on the residue composition of the query sequence and the scoring system employed, as conveyed by the Karlin- Altschul _K and _L_a_m_b_d_a statistics. SEARCH SENSITIVITY Sensitivity of the BLAST programs should be considered in two areas. First, there is the question of how well ungapped alignments (HSPs) can capture or represent the similarity between two biological sequences that may have evolved independently and/or contain sequencing errors. Particularly in the presence of insertions/deletions or frameshifts, it may be necessary to increase E2 (or lower S2), in order to detect the remnants of extended similarity. The amount of evidence or information to support the hypothesis that a given alignment is real and not random decreases with each mutation or sequencing error (States _e_t _a_l., 1991; Gish and States, 1993). As a corollary of this, the expected length of a statistically significant HSP increases with each mutation or sequencing error. At some point, accumulated mutations and errors completely obscure the presence of a relationship between two sequences; the BLAST programs' focus on ungapped alignments may cause this point to be reached sooner than for other alignment methods. The second area where sensitivity may be of concern is in the heuristic nature of the BLAST algorithm for finding HSP alignments. Using this algorithm, along with a properly composed scoring scheme for Karlin-Altschul statistics to be applied, the lower the score is of an HSP, the higher is the probability that the HSP will go undetected. At the user's discretion, the speed of the BLAST algorithm and the pro- grams can be sacrificed in exchange for increased sensi- tivity of detecting these lower significance HSPs, and vice versa; however, the default parameters for all of the pro- grams except blastn have already been chosen to generally obtain moderate (blastx, tblastn, and tblastx) or high (blastp) sensitivity. If sensitivity is not an issue but speed is, then one should consider adjusting the BLAST algo- rithm parameters to achieve higher speed (_e._g., increase W by one and T by 10-50%). Sun Release 4.1 Last change: 20 October 1994 6 BLAST(1L) MISC. REFERENCE MANUAL PAGES BLAST(1L) Raising E2 or lowering S2 can improve the _a_p_p_a_r_e_n_t sensi- tivity of the BLAST programs by permitting them to assess larger sets of HSPs for statistical significance; but lower-scoring HSPs are more difficult to detect, due to the heuristic nature of the BLAST algorithm. Therefore, merely adjusting E2 or S2 may not significantly increase sensi- tivity -- it may also be necessary to adjust the BLAST algorithm's W, T, and X parameters to increase the _t_r_u_e sen- sitivity of the programs. If E2 and S2 are adjusted much from their default values to observe even lower-scoring HSPs, search speed may suffer significantly because the computational complexity of the statistical methods is nonlinear in the number of HSPs that are found. For Sum statistics, the complexity is a qua- dratic function of the number of HSPs; for Poisson statis- tics, the complexity is even worse, a cubic function. Furthermore, as more HSPs are considered, fuzziness in the HSP consistency rules yield more reports of false positives. Without varying the scoring scheme employed, the probability that the BLAST algorithm can detect an HSP having any par- ticular score can be increased by: lowering the neighborhood word score threshold, T, while keeping the word size, W, constant; lowering both W and T appropriately (see Altschul _e_t _a_l., 1990); and/or raising the word hit extension drop- off score X (described earlier). The default value for W is 3 amino acids for blastp, blastx, tblastn, and tblastx, and 11 nucleotides for blastn. For the first 4 BLAST programs, which perform comparisons of amino acid sequences, W should usually be restricted to values less than 5, unless the value for T is specified disproportionately larger, to avoid consuming too much memory for the neighborhood word list (see below and Altschul _e_t _a_l., 1990). X is a positive integer representing the maximum permissible decay of the cumulative segment score during word hit exten- sion. Raising X may decrease the chance that the BLAST algorithm overlooks an HSP, but it may significantly increase the search time, as well. If computation time is of little concern, X might be increased a few points from its default value, but often little or no increase in sensi- tivity is observed by increasing this parameter from its default value. For blastp, blastx, tblastn, and tblastx, the default value for X is calculated to be the minimum integral score representing 10 bits of information, or a decay in the sta- tistical significance of the alignment by a factor of 2 to the tenth power (or about 1,000). Since the X parameter is Sun Release 4.1 Last change: 20 October 1994 7 BLAST(1L) MISC. REFERENCE MANUAL PAGES BLAST(1L) used to terminate extensions independently in both direc- tions, about 1 in 500 alignments are expected to be ter- minated prematurely that would have attained a higher score had termination not come so soon. For blastn, the default value of X is the minimum integral score that represents at least 20 bits of information, or a reduction in the statistical significance of the alignment by a factor of 2 to the twentieth power (or about one mil- lion). THE NEIGHBORHOOD T is the neighborhood word score threshold for generating all words of length W that yield a score of at least T when aligned with some word of length W from the query sequence. The list of words so generated is called the _n_e_i_g_h_b_o_r_h_o_o_d (Altschul _e_t _a_l., 1990). The size of the neighborhood can be increased, thus improving sensitivity, by lowering T. Conversely, raising the value of T decreases the size of the neighborhood and decreases the likelihood of detecting HSPs. Generally, the larger the neighborhood (the lower T is), the slower the programs run, as well. The default value for the neighborhood word score threshold is calculated at run-time from the residue composition and length of the query sequence and the scoring matrix employed, using an _a_d _h_o_c equation that is a function of _L_a_m_b_d_a and _H. Occasionally it may be necessary to manually set the neighborhood word score threshold via the command line, for which 13 may be a good value to try, but a good choice is _h_i_g_h_l_y dependent on the particular scoring matrix and word length used. The PAM120 amino acid scoring matrix supplied with the BLAST programs, produced to a scale of natural log(2)/2, yields values for _L_a_m_b_d_a that are expected to be close to 0.5 bits per unit score for query sequences of typical residue compo- sitions. Under these conditions, an increase in an align- ment score by 2 units is expected to increase the reliabil- ity or informativeness of the alignment by 2 times 0.5 = 1 bit, corresponding to an increase in its statistical signi- ficance by a factor of 2. The supplied PAM250 matrix was produced to a scale of natural log(2)/3, suggesting that an increase in alignment score by 3 units will be required to increase statistical significance by a factor of 2. These are rules of thumb for the matrices mentioned. Generally, the significance of an alignment score is indeterminate without specific knowledge of the scoring matrix employed. If one communicates scores in a report, it may be useful to attach the values for the Karlin-Altschul parameters _L_a_m_b_d_a and _K, so that statistical significance can be properly ascribed to the scores. Sun Release 4.1 Last change: 20 October 1994 8 BLAST(1L) MISC. REFERENCE MANUAL PAGES BLAST(1L) MORE OPTIONS Except where noted, all of the BLAST programs accept the following command line options: -altscore _s_c_o_r_e__s_p_e_c_i_f_i_c_a_t_i_o_n This option can be used to alter entire rows, columns, or just individual scores in a scoring matrix. _s_c_o_r_e__s_p_e_c_i_f_i_c_a_t_i_o_n is a (quoted) character string consisting of three components each separated by at least one space: a letter in the query sequence alphabet (amino acid or nucleotide); a letter in the database sequence alphabet (amino acid or nucleotide); the new pairwise score (integer) to be assigned to the alignment of these two letters. If either character is specified as _a_n_y, then the altered score will be assigned to the entire row or column in the scoring matrix. If the new score is given as _m_i_n (_m_a_x) then the new score assigned will be the minimum (maximum) observed score overall in the matrix; if the the new score is given as _n_a, then the alignment of the two characters will not be allowed (effectively an infinite negative score is assigned to the alignment of the two letters). Mul- tiple -altscore options can be specified on the com- mand line, with each one applying to the scoring matrix last specified in a -matrix option, or to the default scoring matrix if no -matrix option has been used. As an example of this option's use, to assign an alignment score of zero (0) to the presence of a stop codon in either the query sequence or database sequence, these two specifications can be used together: -_a_l_t_s_c_o_r_e "* _a_n_y _0" -_a_l_t_s_c_o_r_e "_a_n_y * _0". -asn1 This option causes the programs to produce print- able, structured output (not for human consumption, but for accurate automated parsing) in conformance with specifications written in the ISO 8824 standard ASN.1 language. -asn1bin This option causes the programs to produce binary- encoded, structured output (not for human consump- tion, but for accurate automated parsing) in confor- mance with specifications written in the ISO 8824 standard ASN.1 language and encoded according to the rules established by ISO 8825. -bottom See the -top option. -codoninfo _c_o_d_o_n_i_n_f_o_f_i_l_e This (blastx version 1.3 only) option is used to specify a file containing codon usage or codon bias Sun Release 4.1 Last change: 20 October 1994 9 BLAST(1L) MISC. REFERENCE MANUAL PAGES BLAST(1L) information to be used in concert with a traditional scoring matrix to score alignments. The file con- taining codon usage information must have a ._c_d_i extension on its name, but this extension should be omitted from the _c_o_d_o_n_i_n_f_o_f_i_l_e argument specified on the command line. Codon usage information should be expressed in units that coincide with the scale of the scoring matrix employed, and the scoring matrix employed must also have a ._c_d_i extension to its name. A few such pairs of scoring matrix and codon usage files are provided in the BLAST software dis- tribution. blastx expects to find the codon usage files in the /usr/ncbi/blast/cdi directory, or the program can be directed to look in another directory by setting the BLASTCDI environment variable. _N_O_T_E: _t_h_i_s _o_p_t_i_o_n _i_s _p_r_e_s_e_n_t_l_y _s_u_p_p_o_r_t_e_d _o_n_l_y _b_y _t_h_e _p_r_e_- _v_i_o_u_s _v_e_r_s_i_o_n _1._3 _o_f blastx. -compat1.3 This option is used to invoke behavior from the BLAST version 1.4 programs that is very similar to that of the previous version 1.3 programs. This option affects the -poissonp, -span1, -olfraction 0.5, -ctxfactor, E and E2 -consistency This option turns off both the determination of the number of HSPs that are _c_o_n_s_i_s_t_e_n_t with each other in a gapped alignment and an adjustment that is made to the Sum and Poisson statistics to account for the consistency. -dbbottom See -dbtop. -dbgcode _g_e_n_e_t_i_c__c_o_d_e__I_D For the tblastx program, which translates both the query sequence and the database, this option permits the genetic code used to translate the database to be set separately from the genetic code used to translate the query sequence. This option may also be used to set the genetic code used by tblastn to translate the database. See the list of genetic code identifiers later in this document. See also the -gcode option. -dbrecmax _l_a_s_t__r_e_c_o_r_d__n_u_m_b_e_r By default the BLAST programs search the entire database. Using the -dbrecmax option, the record number of the last database sequence to search can be specified. See also the -dbrecmin option. Sun Release 4.1 Last change: 20 October 1994 10 BLAST(1L) MISC. REFERENCE MANUAL PAGES BLAST(1L) -dbrecmin _f_i_r_s_t__r_e_c_o_r_d__n_u_m_b_e_r By default the BLAST programs search the entire database. Using the -dbrecmin option, the record number of the first database sequence to search can be specified. Searching will continue from that point on, until the end of the database is reached or until the sequence is reached whose record number corresponds to that specified in a -dbrecmax option. Record numbers are one-based (_i._e., 1 is the first record, 2 is the second record, and so on). Statis- tics are computed using the complete database length, not the length of the subset selected. See also the -dbrecmax option. -dbtop For those programs that translate a nucleotide sequence database (tblastn and tblastx), the -dbtop and -dbbottom options can be specified to restrict the search to a particular strand of each database sequence. The top strand consists of the database sequence as stored in the database; the bottom strand refers to the reverse complement of the data- base sequence. -echofilter This option causes the filtered query sequence to be displayed in the output. Any masked letters are typically indicated with X's (protein) or N's (nucleic acid). -filter _f_i_l_t_e_r_m_e_t_h_o_d This option activates filtering or masking of seg- ments of the query sequence based on a potentially wide variety of criteria. The usual intent of filtering is to mask regions that are non-specific for protein identification using sequence similar- ity. For instance, it may be desired to mask acidic or basic segments that would otherwise yield overwhelming amounts of uninteresting, non-specific matches against a wide array of protein families from a comprehensive database search. The BLAST programs have internally-coded knowledge of the specific command line options needed to invoke the SEG and XNU programs as query sequence filters, but these two filter programs are not included in the BLAST software distribution and must be indepen- dently installed. All filter programs must reside in the /usr/ncbi/blast/filter directory, or the BLASTFILTER environment variable must be set to point to the directory containing the desired filter programs. The SEG program (Wootton and Federhen, 1993) masks low compositional complexity regions, while XNU (Claverie and States, 1993) masks regions Sun Release 4.1 Last change: 20 October 1994 11 BLAST(1L) MISC. REFERENCE MANUAL PAGES BLAST(1L) containing short-periodicity internal repeats. The BLAST programs can pipe the filtered output from one program into another. For instance, XNU+SEG or SEG+XNU can be specified as the _f_i_l_t_e_r_m_e_t_h_o_d to have each program filter the query sequence in succes- sion. Note that neither SEG nor XNU is suitable for filtering untranslated nucleotide sequences for use by blastn. -gapdecayrate _r_a_t_e This parameter defines the common ratio of the terms in a geometric progression used in normalizing pro- babilities across all numbers of Poisson events (typically the number of "consistent" HSPs). A Poisson probability for _N segments is weighted by the reciprocal of the _Nth term in the progression, where the first term has a value of (_1-_r_a_t_e), the second term is (_1-_r_a_t_e)*_r_a_t_e, the third term is (_1- _r_a_t_e)*_r_a_t_e*_r_a_t_e, and so on. The default _r_a_t_e is 0.5, such that the probability assigned to a single HSP is discounted by a factor of 2, the Poisson pro- bability of 2 HSPs is discounted by a factor of 4, for 3 HSPs the discount factor is 8, and so on. The rate essentially defines a penalty imposed on the gap between each HSP, where the default penalty is equivalent to 1 bit of information. The suggestion to normalize Poisson probabilities was made by Phil Green (University of Washington, Seattle, WA). -gcode _g_e_n_e_t_i_c__c_o_d_e__I_D This parameter permits the genetic code used in translating nucleotide query sequences to be changed from its default value of the Standard genetic code (sometimes erroneously called the "Universal" genetic code). See the available list of genetic code identifiers below. _N_o_t_e: _t_h_e C parameter is a synonym for the -gcode parameter. -gi When GenInfo _g_i identifiers are available for the database sequences (in their deflines), this option can be specified to have these identifiers reported in the program output. -hspmax _m_a_x__h_s_p_s__p_e_r__d_b_s_e_q This option can be used to limit the number of HSPs reported per database sequence. The default limit is 100, which is ample leeway for most searches. Notable exceptions are when numerous, significant repetitive regions exist in the query or database sequences, such as the hundreds of copies of human _A_l_u repeats that exist in some longer database sequences. Sun Release 4.1 Last change: 20 October 1994 12 BLAST(1L) MISC. REFERENCE MANUAL PAGES BLAST(1L) -matrix _m_a_t_r_i_x_f_i_l_e This option is used to specify the name of a file containing an alternate or user-defined scoring matrix. Most of the programs will accept only one -matrix option at a time, but blastp currently accepts as many as eight (8) on a single command line, all of which are used simultaneously during the database search for increased sensitivity. -nwlen _l_e_n_g_t_h See -nwstart. -nwstart _s_t_a_r_t__c_o_o_r_d blastp and blastx support this option and the -nwlen option, for restricting BLAST neighborhood word gen- eration to a specific segment of the query sequence that begins at _s_t_a_r_t__c_o_o_r_d and continues for _l_e_n_g_t_h residues or until the end of the query sequence is reached. HSP alignments may extend outside the region of neighborhood word generation but the alignments can only be initiated by word hits occur- ring within the region. Through the use of these options, a very long query sequence can be searched piecemeal, using short, overlapping segments each time. The amount of overlap from one neighborhood region to the next need only be the BLAST wordlength W minus 1, in order to be assured of detecting all HSPs; however, to provide greater freedom for sta- tistical interpretation of multiple HSP findings -- _e._g., matches against exons -- more extensive over- lapping is recommended, with the extent to be chosen based on the expected gene density and length of introns. -olfraction _o_v_e_r_l_a_p__f_r_a_c_t_i_o_n This parameter (with default value of 0.125) allows the user to define the maximum fractional length of an HSP that can overlap another HSP and still have the two HSPs be considered to be consistent with one another, for the purpose of evaluation with Karlin- Altschul Sum statistics or Poisson statistics. -outblk This option causes ASN.1 output to be encapsulated in a BLAST0-Outblk structure. For a description of this structure, see the ASN.1 message specifications accompanying the BLAST program source code. -poissonp This option causes Poisson statistics, instead of the default Sum statistics, to be used in assessing the statistical significance of multiple HSPs. Sun Release 4.1 Last change: 20 October 1994 13 BLAST(1L) MISC. REFERENCE MANUAL PAGES BLAST(1L) -prune This option causes HSPs that are not involved in achieving statistical significance to be eliminated from the program output. When Sum statistics are used, the pruning is robust; when Poisson statistics are used, some HSPs may be reported that were not involved in achieving statistical significance. -qoffset _o_f_f_s_e_t This option permits query sequence coordinate numbers to be adjusted by the value of _o_f_f_s_e_t, through simple addition. This may useful when a query sequence must be split into short, overlapping segments in order to complete individual searches within a restrictive time period. -qres This option causes the BLAST programs to exit non- zero if the query sequence contains an invalid letter code for the type of query sequence expected (amino acid or nucleic acid). -qtype This option causes the BLAST programs to exit non- zero if the query sequence appears to be of the wrong type (either amino acid or nucleic acid) for the particular program invoked. -span This option turns off entirely the feature of detecting and discarding spanned HSPs. Voluminous output often results from its use. _N_o_t_e: _t_h_i_s _o_p_t_i_o_n _w_a_s _p_r_e_v_i_o_u_s_l_y _c_a_l_l_e_d -overlap _i_n _t_h_e _B_L_A_S_T version 1.3 programs. -span1 This option relaxes the criteria for judging whether an HSP spans another, prior to discarding one of them if spanning is detected. With this option, it is merely a matter of either the query segment or the database segment (or both) spans the correspond- ing segment(s) in the other HSP, whereas the -span2 option requires that _b_o_t_h segments be spanned. The -span1 option may be useful in suppressing reports of HSPs when the query or a database sequence con- tains internal repeats. _N_o_t_e: _t_h_i_s _o_p_t_i_o_n _w_a_s _p_r_e_- _v_i_o_u_s_l_y _c_a_l_l_e_d -overlap1 _i_n _t_h_e _B_L_A_S_T version 1.3 programs. -span2 While examining each database sequence, the programs use a greedy algorithm to discard any HSP they find which is spanned from start to end by a previously found HSP. When this option is invoked (the default), an HSP is deemed to be _s_p_a_n_n_i_n_g another when both the query and database segments from the first HSP completely cover the corresponding seg- ments in the other HSP. When an HSP spans another, Sun Release 4.1 Last change: 20 October 1994 14 BLAST(1L) MISC. REFERENCE MANUAL PAGES BLAST(1L) the higher scoring one is retained and the lower scoring one is discarded; if their scores are equal, the longer, less information-dense HSP is discarded. _N_o_t_e: _t_h_i_s _o_p_t_i_o_n _w_a_s _p_r_e_v_i_o_u_s_l_y _c_a_l_l_e_d -overlap2 _i_n _t_h_e _B_L_A_S_T version 1.3 programs. -sump This option (the default) causes Karlin and Altschul (1993) "Sum" statistics to be used in assessing the statistical significance of multiple HSPs. See also -poissonp. -top Whenever a nucleotide query sequence is used (blastn, blastx and tblastx), both strands or all 6 reading frames are searched by default. The -top and -bottom options may be used to restrict a search to the specified strand or set of 3 reading frames. If both -top and -bottom are specified, both strands will be searched. In the case of the tblastx pro- gram, which translates both the query and the data- base, the -top and -bottom options refer to strands in the query sequence only. See -dbtop and -dbbot- tom. -warnings This option turns off the reporting of all WARNING messages. options. SORT OPTIONS The default sort order for reporting database sequences is by increasing probability (P-value). The following sort options are available and may be combined together in the same search: -sort_by_pvalue Sort from most statistically significant (lowest P-value) to least statistically significant (highest P-value), the default sort order. -sort_by_count Sort from highest to lowest by the number of HSPs found for each database sequence. -sort_by_highscore Sort from highest to lowest by the score of the highest scoring HSP for each database sequence. -sort_by_totalscore Sort from the highest to the lowest by the sum total score of all HSPs for each database sequence. SCORING SCHEMES Sun Release 4.1 Last change: 20 October 1994 15 BLAST(1L) MISC. REFERENCE MANUAL PAGES BLAST(1L) The default scoring matrix used by blastp, blastx, tblastn, and tblastx is the BLOSUM62 matrix (Henikoff and Henikoff, 1992). The -matrix option can be used to select an alter- nate scoring matrix file (_e._g., one of the PAM matrices described below). In version 1.4, the -matrix option can also be used with blastn to define a scoring matrix, in addition to supporting the traditional M and N parameters of this program. Several PAM (point accepted mutations per 100 residues) amino acid scoring matrices are provided in the BLAST software distribution, including the PAM40, PAM120, and PAM250. While the BLOSUM62 matrix is a good general purpose scoring matrix and is the default matrix used by the BLAST programs, if one is restricted to using only PAM scoring matrices, then the PAM120 is recommended for general protein similarity searches (Altschul, 1991). The pam(1) program can be used to produce PAM matrices of any desired iteration from 2 to 511. Each matrix is most sensitive at finding similarities at its particular PAM distance. For more thorough searches, particularly when the mutational distance between potential homologs is unknown and the significance of their similarity may be only marginal, Altschul (1991, 1992) recommends performing at least three searches, one each with the PAM40, PAM120 and PAM250 matrices. When multiple scoring matrices are used in searches with the same query sequence, additional degrees of freedom for optimizing alignment scores are available, which reduces each score's statistical significance. The reduction may be by a factor that is as large as the number of matrices employed; however, the potential loss of sensitivity from using a suboptimal matrix is typically much greater, sug- gesting that the use of multiple matrices remains advanta- geous (Altschul, 1992). Altschul (1992) has shown that, because PAM matrices are related to one another through a common mutational model and set of initial conditions, sta- tistical significance is reduced by a factor of no more than 4.6 (just over 2 bits of information) regardless of how many PAM matrices are employed. In blastn, the M parameter sets the reward score for a pair of matching residues; the N parameter sets the penalty score for _m_i_smatching residues. M and N must be positive and negative integers, respectively. The relative magnitudes of M and N determines the number of nucleic acid PAMs (point accepted mutations per 100 residues) for which they are most sensitive at finding homologs. Higher ratios of M:N correspond to increasing nucleic acid PAMs (increased diver- gence). The default values for M and N, respectively 5 and -4, having a ratio of 1.25, correspond to about 47 nucleic acid PAMs, or about 58 amino acid PAMs; an M:N ratio of 1 Sun Release 4.1 Last change: 20 October 1994 16 BLAST(1L) MISC. REFERENCE MANUAL PAGES BLAST(1L) corresponds to 30 nucleic acid PAMs or 38 amino acid PAMs. At higher than about 40 nucleic acid PAMs, or 50 amino acid PAMs, better sensitivity at detecting similarities between coding regions is expected by performing comparisons at the amino acid level (States _e_t _a_l., 1991), using conceptually translated nucleotide sequences (re: blastx, tblastn, and tblastx). Independent of the values chosen for M and N, the default wordlength W=11 used by blastn restricts the program to finding sequences that share at least an 11-mer stretch of 100% identity with the query. Under the random sequence model, stretches of 11 consecutive matching residues are unlikely to occur merely by chance even between only moderately diverged homologs. Thus, blastn with its _d_e_f_a_u_l_t parameter settings is poorly suited to finding anything but very similar sequences. If better sensitivity is needed, one should use a smaller value for W. For the blastn program, it may be easy to see how multiply- ing both M and N by some large number will yield proportion- ally larger alignment scores with their statistical signifi- cance remaining unchanged. This scale-independence of the statistical significance estimates from blastn has its ana- log in the scoring matrices used by the other BLAST pro- grams: multiplying all elements in a scoring matrix by an arbitrary factor will proportionally alter the alignment scores but will not alter their statistical significance (assuming numerical precision is maintained). From this it should be clear that raw alignment scores are meaningless without specific knowledge of the scoring matrix that was used. SCORING REQUIREMENTS Regardless of the scoring scheme employed, two stringent criteria must be met in order to be able to calculate the Karlin-Altschul parameters _L_a_m_b_d_a and _K. First, given the residue composition for the query sequence and the residue composition assumed for the database, the alignment score expected for any randomly selected pair of residues (one from the query sequence and one from the database) must be negative. Second, given the sequence residue compositions and the scoring scheme, a positive score must be possible to achieve. For instance, the match reward score of blastn must have a positive value; and given the assumption made by blastn that the 4 nucleotides A, C, G and T are represented at equal 25% frequencies in the database, a wide range of value combinations for M and N are precluded from use -- namely those combinations where the magnitude of the ratio M:N is greater than or equal to 3. Sun Release 4.1 Last change: 20 October 1994 17 BLAST(1L) MISC. REFERENCE MANUAL PAGES BLAST(1L) SEQUENCE LENGTH AND STATISTICAL SIGNIFICANCE For the purpose of calculating significance levels, Y is the effective length of the query sequence and Z is the effec- tive length of the database, both measured in residues. The default values for these parameters are the actual lengths of the query sequence and database, respectively. Larger values signify more degrees of freedom for aligning the sequences and reduced statistical significance for an align- ment of any given score. To normalize the statistics reported when databases of different lengths are searched, the parameter Z may be set to a constant value for all data- base searches. Similarly, when querying with sequences of different lengths, the parameter Y can be used to normalize over all searches. GENETIC CODES The parameter C can be set to a positive integer to select the genetic code that will be used by blastx and tblastx to translate the query sequence. The -dbgcode parameter can be used to select an alternate genetic code for translation of the database by the programs tblastn and tblastx. In each case, the default genetic code is the so-called "Standard" or "Universal" genetic code. To obtain a listing of the genetic codes available and their associated numerical iden- tifiers, invoke blastx or tblastx with the command line parameter _C=_l_i_s_t. Note: the numerical identifiers used here for genetic codes parallel those defined in the NCBI software Toolbox; hence some numerical values will be skipped as genetic codes are updated. The list of genetic codes available and their associated values for the parameters C and -dbgcode are: 1 Standard or Universal 2 Vertebrate Mitochondrial 3 Yeast Mitochondrial 4 Mold, Protozoan, Coelenterate Mitochondrial and Mycoplasma/Spiroplasma 5 Invertebrate Mitochondrial 6 Ciliate Macronuclear 9 Echinodermate Mitochondrial 10 Alternative Ciliate Macronuclear 11 Eubacterial Sun Release 4.1 Last change: 20 October 1994 18 BLAST(1L) MISC. REFERENCE MANUAL PAGES BLAST(1L) 12 Alternative Yeast 13 Ascidian Mitochondrial 14 Flatworm Mitochondrial SUM STATISTICS Whereas the version 1.3 BLAST programs use Poisson statis- tics to ascribe significance to multiple HSPs, the version 1.4 programs retain Poisson statistics as an option, but use Karlin and Altschul (1993) "Sum" statistics by default instead. Sum statistics tends to rank database matches in a more intuitive order than Poisson statistics and, in many cases, yields markedly increased sensitivity. The Sum P- value for a set of HSPs is a function of the sum of the information scores of the HSPs (expressed in bits) and the number of HSPs in the set. POISSON STATISTICS The occurrence of two or more HSPs involving the query sequence and the same database sequence can be modeled as a Poisson process by specifying the -poissonp option. An important result of applying Poisson statistics is that an HSP having a low score and high Expect value (low statisti- cal significance) may be ascribed a statistically signifi- cant Poisson P-value when the HSP appears in the context of additional match(es) of equal or greater score with the same database sequence. The Poisson P-value for any given HSP is a function of its expected frequency of occurrence and the number of HSPs observed against the same database sequence with scores at least as high. The Poisson P-value for a group of HSP events is the probability that at least as many HSPs would occur by chance alone, each with a score at least as high as the lowest-scoring member of the group. HSPs which appear on opposite strands of a nucleotide query or database sequence are considered to be independent, distinguishable events, and are counted separately. P-VALUES, ALIGNMENT SCORES, AND INFORMATION The Expect and P-values reported for HSPs are dependent on several factors including: the scoring system employed, the residue composition of the query sequence, an assumed resi- due composition for a typical database sequence, the length of the query sequence, and the total length of the database. HSP scores from different program invocations are appropri- ate for comparison even if the databases searched are of different lengths, as long as the other factors mentioned here do not vary. For example, alignment scores from searches with the default BLOSUM62 matrix should not be directly compared with scores obtained with the PAM120 Sun Release 4.1 Last change: 20 October 1994 19 BLAST(1L) MISC. REFERENCE MANUAL PAGES BLAST(1L) matrix; and scores produced using two versions of the same PAM matrix, each created to different scales (see above), can not be meaningfully compared without conversion to the same scale. Some isolation from the many factors involved in assessing the statistical significance of HSPs can be attained by observing the information content reported (in bits) for the alignments. While the information content of an HSP may change when different scoring systems are used (e.g., with different PAM matrices), the number of bits reported for an HSP will at least be independent of the scale to which the scoring matrix was generated. (In practice, this statement is not quite true, because the alignment scores used by the BLAST programs are integers that lack much precision). In other words, when conveying the statistical significance of an alignment, the alignment score itself is not useful unless the specific scoring matrix that was employed is also provided, but the _i_n_f_o_r_m_a_t_i_v_e_n_e_s_s of an alignment is a mean- ingful statistic that can be used to ascribe statistical significance (a P-value) to the match independently of specific knowledge about the scoring matrix. GOVERNING OUTPUT BLAST program output is organized into three independently governed sections: a histogram of the statistical signifi- cance of the matches found; one-line descriptions of the database sequences that satisfied the statistical signifi- cance threshold (E parameter); and the high-scoring segment pairs themselves. Each section of the output can be selec- tively suppressed by setting the parameters H, V, and B to 0 (zero). The H parameter regulates the display of a histogram of the expected frequency of chance occurrence of the database matches found. If H is assigned a non-zero value, a histo- gram will be displayed. The default value for H is 0 (no histogram displayed). Parameter V is the maximum number of database sequences for which one-line descriptions will be reported. The default value for V is 500. A bold warning message is displayed at the end of the one-line descriptions section when more than V sequences yield HSPs satisfying the significance thres- hold. When V is zero, no one-line descriptions are reported and no warning is given. Negative values for V are unde- fined and disallowed. As an example of how V can be used advantageously, if a high value for E is desired to virtually assure in all cases that at least one HSP will be found, selecting a small value for V will ensure that the output will not be overly voluminous; Sun Release 4.1 Last change: 20 October 1994 20 BLAST(1L) MISC. REFERENCE MANUAL PAGES BLAST(1L) only the most statistically significant matches will be reported. Parameter B regulates the display of the high-scoring seg- ment pairs (alignments). For positive values, B is the max- imum number of _d_a_t_a_b_a_s_e _s_e_q_u_e_n_c_e_s for which high-scoring segment pairs will be reported. This may be much smaller than the actual number of high-scoring segment pairs reported, since any given database sequence may yield several HSPs. The default value for B is 250. Negative values for B are undefined and disallowed. ENVIRONMENT VARIABLES The environment variables BLASTDB, BLASTMAT, BLASTFILTER, and BLASTCDI may be set by the user to override the default directories in which the programs look to find database files, scoring matrix files, filtering programs, and codon usage information files, respectively. The default direc- tories are /usr/ncbi/blast/db, /usr/ncbi/blast/matrix, /usr/ncbi/blast/filter, and /usr/ncbi/blast/cdi. SUPPORT UTILITIES Databases to be searched by the BLAST programs must first be formatted by the setdb program for protein sequence data- bases (re: blastp and blastx) or the pressdb program for nucleotide sequence databases (re: blastn and tblastn). The input database files read by setdb and pressdb must be in FASTA/Pearson format. For each input file, three output files are created for searching by the BLAST programs. Point accepted mutation (PAM) matrices of various genera- tions can be produced automatically with the pam program. The output can be saved in a file whose name can then be specified in the M=filename option of a blastp, blastx, or tblastn query. SAMPLE OUTPUT The BLAST programs all provide information in roughly the same format. First comes (A) an introduction to the pro- gram; (B) a histogram of expectations (see above) if one was requested; (C) a series of one-line descriptions of matching database sequences; (D) the actual sequence alignments; and finally the parameters and other statistics gathered during the search. Sample blastp output from comparing _p_i_r|_A_0_1_2_4_3|_D_X_C_H against the SWISS-PROT database is presented below. A. Program Introduction The introductory output provides the program name (BLASTP in this case), the version number (1.4.6MP in this case), the date the program source code last changed substantially Sun Release 4.1 Last change: 20 October 1994 21 BLAST(1L) MISC. REFERENCE MANUAL PAGES BLAST(1L) (June 13, 1994), the date the program was built (Sept. 22, 1994), and a description of the query sequence and database to be searched. These may all be important pieces of infor- mation if a bug is suspected or if reproducibility of results is important. The "Searching..." indicator indicates progress that the program made in searching the database. A complete database search will yield 50 periods (.), or one period per database sequence, whichever number is smaller. When searching a database consisting of 50 sequences or more, if fewer than 50 periods are displayed and the program aborted for some reason, dividing the number of periods by 0.5 will yield the approximate percentage (0-100%) of the database that was searched before the program died. If the program had diffi- culty making progress through the database, one or more asterisks (*) may be interspersed between the periods at one-minute intervals. B. Histogram of Expectations Shown in the output below is a histogram of the lowest (most significant) Expect values obtained with each database sequence. This information is useful in determining the numbers of database sequences that achieved a particular level of statistical significance. It indicates the number of database matches that would be reportable at various set- tings for the expectation threshold (E parameter). C. One-line Summaries The one-line sequence descriptions and summaries of results are useful for identifying biologically interesting database matches and correlating this interest with the statistical significance estimates. Unless otherwise requested, the database sequences are sorted by increasing P-value (proba- bility). Identifiers for the database sequences appear in the first column; then come brief descriptions of each sequence, which may need to be truncated in order to fit in the available space. The "High Score" column contains the score of the highest-scoring HSP found with each database sequence. The "P(N)" column contains the lowest P-value ascribed to any set of HSPs for each database sequence; and the "N" column displays the number of HSPs in the set which was ascribed the lowest P-value. The P-values are a func- tion of N, as used in Karlin-Altschul "Sum" statistics or Poisson statistics, to treat situations where multiple HSPs are found. It should be noted that the highest-scoring HSP whose score is reported in the "High Score" column is not necessarily a member of the set of HSPs which yields the lowest P-value; the highest-scoring HSP may be excluded from this set on the basis of consistency rules governing the grouping of HSPs (see the -consistency option). Numbers of the form "7.7e-160" are in scientific notation. In this Sun Release 4.1 Last change: 20 October 1994 22 BLAST(1L) MISC. REFERENCE MANUAL PAGES BLAST(1L) particular example, the number being represented is 7.7 times 10 to the minus 160th power. which is astronomically close to zero. D. Alignments Alignments found with the BLAST algorithm are ungapped. Several statistics are used to describe each HSP: the raw alignment Score; the raw score converted to bits of informa- tion by multiplying by _L_a_m_b_d_a (see the Statistics output); the number of times one might Expect to see such a match (or a better one) merely by chance; the P-value (probability in the range 0-1) of observing such a match; the number and fraction of total residues in the HSP which are identical; the number and fraction of residues for which the alignment scores have positive values. When Sum statistics have been used to calculate the Expect and P-values, the P-value is qualified with the word "Sum" and the N parameter used in the Sum statistics is provided in parentheses to indicate the number of HSPs in the set; when Poisson statistics have been used to calculate the Expect and P-values, the P-value is qualified with the word "Poisson". Between the two lines of Query and Subject (database) sequence is a line indicat- ing the specific residues which are identical, as well as those which are non-identical but nevertheless have positive alignment scores defined in the scoring matrix that was used (the BLOSUM62 matrix in this case). Identical letters or residues, when paired with each other, are not highlighted if their alignment score is negative or zero. Examples of this would be an X juxtaposed with an X in two amino acid sequences, or an N juxtaposed with another N in two nucleo- tide sequences. Such ambiguous residue-residue pairings may be uninformative and thus lend no support to the overall alignment being either real or random; however, the informa- tiveness of these pairings is left up to the user of the BLAST programs to decide, because any values desired can be specified in a scoring matrix of the user's own making. BLASTP 1.4.6MP [13-Jun-94] [Build 13:58:36 Sep 22 1994] Reference: Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman (1990). Basic local alignment search tool. J. Mol. Biol. 215:403-10. Query= pir|A01243|DXCH 232 Gene X protein - Chicken (fragment) (232 letters) Database: SWISS-PROT Release 29.0 38,303 sequences; 13,464,008 total letters. Searching..................................................done Observed Numbers of Database Sequences Satisfying Sun Release 4.1 Last change: 20 October 1994 23 BLAST(1L) MISC. REFERENCE MANUAL PAGES BLAST(1L) Various EXPECTation Thresholds (E parameter values) Histogram units: = 31 Sequences : less than 31 sequences EXPECTation Threshold (E parameter) | V Observed Counts--> 10000 4863 1861 |============================================================ 6310 3002 782 |========================= 3980 2220 812 |========================== 2510 1408 303 |========= 1580 1105 393 |============ 1000 712 179 |===== 631 533 161 |===== 398 372 80 |== 251 292 73 |== 158 219 50 |= 100 169 32 |= 63.1 137 18 |: 39.8 119 9 |: 25.1 110 6 |: 15.8 104 9 |: >>>>>>>>>>>>>>>>>>>>> Expect = 10.0, Observed = 95 <<<<<<<<<<<<<<<<< 10.0 95 4 |: 6.31 91 3 |: 3.98 88 1 |: 2.51 87 3 |: 1.58 84 0 | 1.00 84 2 |: Smallest Sum High Probability Sequences producing High-scoring Segment Pairs: Score P(N) N sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED) (... 1191 7.7e-160 1 sp|P01014|OVAY_CHICK GENE Y PROTEIN (OVALBUMIN-RELATED). 949 7.0e-127 1 sp|P01012|OVAL_CHICK OVALBUMIN (PLAKALBUMIN). 645 3.4e-100 2 sp|P19104|OVAL_COTJA OVALBUMIN. 626 1.2e-96 2 sp|P05619|ILEU_HORSE LEUKOCYTE ELASTASE INHIBITOR (LEI). 216 3.7e-71 3 sp|P80229|ILEU_PIG LEUKOCYTE ELASTASE INHIBITOR (LEI) (... 325 4.0e-71 2 sp|P29508|SCCA_HUMAN SQUAMOUS CELL CARCINOMA ANTIGEN (SCC... 439 3.5e-70 2 sp|P30740|ILEU_HUMAN LEUKOCYTE ELASTASE INHIBITOR (LEI) (... 211 1.3e-66 3 sp|P05120|PAI2_HUMAN PLASMINOGEN ACTIVATOR INHIBITOR-2, P... 176 1.8e-65 4 sp|P35237|PTI_HUMAN PLACENTAL THROMBIN INHIBITOR. 473 1.3e-61 1 sp|P29524|PAI2_RAT PLASMINOGEN ACTIVATOR INHIBITOR-2, T... 183 9.4e-61 4 sp|P12388|PAI2_MOUSE PLASMINOGEN ACTIVATOR INHIBITOR-2, M... 179 1.8e-60 4 sp|P36952|MASP_HUMAN MASPIN PRECURSOR. 198 2.6e-58 4 sp|P32261|ANT3_MOUSE ANTITHROMBIN-III PRECURSOR (ATIII). 142 4.0e-48 5 sp|P01008|ANT3_HUMAN ANTITHROMBIN-III PRECURSOR (ATIII). 122 7.5e-48 5 Sun Release 4.1 Last change: 20 October 1994 24 BLAST(1L) MISC. REFERENCE MANUAL PAGES BLAST(1L) WARNING: Descriptions of 80 database sequences were not reported due to the limiting value of parameter V = 15. ... alignments with the top 8 database sequences deleted ... >sp|P05120|PAI2_HUMAN PLASMINOGEN ACTIVATOR INHIBITOR-2, PLACENTAL (PAI-2) (MONOCYTE ARG- SERPIN). Length = 415 Score = 176 (80.2 bits), Expect = 1.8e-65, Sum P(4) = 1.8e-65 Identities = 38/89 (42%), Positives = 50/89 (56%) Query: 1 QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNN 60 +I +LL S D DT +VLVNA+YFKG WKT F + PF V + PVQMM + Sbjct: 180 KIPNLLPEGSVDGDTRMVLVNAVYFKGKWKTPFEKKLNGLYPFRVNSAQRTPVQMMYLRE 239 Query: 61 SFNVATLPAEKMKILELPFASGDLSMLVL 89 N+ + K +ILELP+A L+L Sbjct: 240 KLNIGYIEDLKAQILELPYAGDVSMFLLL 268 Score = 165 (75.2 bits), Expect = 1.8e-65, Sum P(4) = 1.8e-65 Identities = 33/78 (42%), Positives = 47/78 (60%) Query: 155 ANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHPFLFL 214 AN +G+S L +S+ H A ++++E+G E A TG + + QF ADHPFLFL Sbjct: 338 ANFSGMSERNDLFLSEVFHQAMVDVNEEGTEAAAGTGGVMTGRTGHGGPQFVADHPFLFL 397 Query: 215 IKHNPTNTIVYFGRYWSP 232 I H T I++FGR+ SP Sbjct: 398 IMHKITKCILFFGRFCSP 415 Score = 144 (65.6 bits), Expect = 1.8e-65, Sum P(4) = 1.8e-65 Identities = 26/62 (41%), Positives = 41/62 (66%) Query: 90 LPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTSVLMALGMTD 149 + D + LE +E I ++KL +WT+ + M + V+VY+PQ K+EE Y L S+L ++GM D Sbjct: 272 IADVSTGLELLESEITYDKLNKWTSKDKMAEDEVEVYIPQFKLEEHYELRSILRSMGMED 331 Query: 150 LF 151 F Sbjct: 332 AF 333 Score = 61 (27.8 bits), Expect = 1.8e-65, Sum P(4) = 1.8e-65 Identities = 10/17 (58%), Positives = 16/17 (94%) Query: 81 SGDLSMLVLLPDEVSDL 97 +GD+SM +LLPDE++D+ Sbjct: 259 AGDVSMFLLLPDEIADV 275 WARNING: HSPs involving 86 database sequences were not reported due to the Sun Release 4.1 Last change: 20 October 1994 25 BLAST(1L) MISC. REFERENCE MANUAL PAGES BLAST(1L) limiting value of parameter B = 9. Parameters: V=15 B=9 H=1 -ctxfactor=1.00 E=10 Query ----- As Used ----- ----- Computed ---- Frame MatID Matrix name Lambda K H Lambda K H +0 0 BLOSUM62 0.316 0.132 0.370 same same same Query Frame MatID Length Eff.Length E S W T X E2 S2 +0 0 232 232 10. 57 3 11 22 0.22 33 Statistics: Query Expected Observed HSPs HSPs Frame MatID High Score High Score Reportable Reported +0 0 62 (28.2 bits) 1191 (542.5 bits) 330 24 Query Neighborhd Word Excluded Failed Successful Overlaps Frame MatID Words Hits Hits Extensions Extensions Excluded +0 0 4988 5661199 1146395 4504598 10187 13 Database: SWISS-PROT Release 29.0 Release date: June 1994 Posted date: 1:29 PM EDT Jul 28, 1994 # of letters in database: 13,464,008 # of sequences in database: 38,303 # of database sequences satisfying E: 95 No. of states in DFA: 561 (55 KB) Total size of DFA: 110 KB (128 KB) Time to generate neighborhood: 0.03u 0.01s 0.04t Real: 00:00:00 No. of processors used: 8 Time to search database: 32.27u 0.78s 33.05t Real: 00:00:04 Total cpu time: 32.33u 0.91s 33.24t Real: 00:00:05 WARNINGS ISSUED: 2 BUGS The statistics are not fully worked out yet for blastp when multiple -matrix options are specified in a single command. blastn by default uses a large value of 11 for the wordlength, W, which severely reduces the program's sensi- tivity but provides for high speed searches. Consequently, the program with its default parameter values is well suited Sun Release 4.1 Last change: 20 October 1994 26 BLAST(1L) MISC. REFERENCE MANUAL PAGES BLAST(1L) to finding nearly identical sequences rapidly, but poorly suited to finding moderately- or distantly-related sequences. The value for W may be reduced to increase the sensitivity (at the expense of speed), but to identify weak similarity between coding regions, greater sensitivity is obtained by comparing translation products (States _e_t _a_l., 1991); one should use blastx, tblastn, or tblastx. blastn is poorly suited to characterizing PCR primers. In the protein-comparing programs blastp, blastx, tblastn, and tblastx, _a_d _h_o_c equations are used to calculate a default value for the neighborhood word score threshold T when the word length W has a value of 3 (the default) or 4. Equations have not been implemented for calculating a default value of T when W has any value other than 3 or 4. When nucleotide sequence databases are compressed into searchable form by the pressdb program, IUPAC ambiguity letters are replaced by an appropriate random selection from the list A, C, G and T. For example, an R (purine) would be replaced on the average half of the time by an A (adenosine) and the remainder of the time by a G (guanosine). If the original database in FASTA format is not available to blastn and tblastn at the time of the search, then the original locations and identities of the ambiguity codes can not be determined by these programs and the alignments and align- ment scores may be in error with respect to the original sequences. tblastn and tblastx use only one genetic code to translate the entire nucleotide sequence database, although the code that is used is selectable via the -dbgcode option. blastn, blastx, tblastn, and tblastx treat U and T residues in nucleotide sequences as being the same residue (_i._e., they match perfectly or translate in exactly the same manner). The amino acid alphabet used by the BLAST programs consists of the IUB and IUPAC amino acid codes (ABCDEF- GHIKLMNPQRSTVWXYZ), plus asterisk (*) and hyphen (-). An asterisk signifies a stop codon; and a hyphen signifies a gap of indeterminate length through which BLAST alignments are never permitted to extend. Any letter which is not a member of this alphabet will be stripped from an amino acid query sequence on input and will not contribute to the query sequence coordinate numbers displayed in program output. In protein sequence databases that are processed into search- able form by the setdb program, any non-alphabetic letters are also stripped. Sun Release 4.1 Last change: 20 October 1994 27 BLAST(1L) MISC. REFERENCE MANUAL PAGES BLAST(1L) The nucleotide alphabet used by the BLAST programs consists of the IUB and IUPAC nucleotide codes (ACGTRYMKWSBDHVNU), plus hyphen (-) to signify a gap of indeterminate length. U (uracil) is treated like a T (thymidine). When non- alphabetical codes appear in the FASTA-format input database to the pressdb program, the program complains about their appearance and then halts with a non-zero exit status. Unlike its version 1.3 predecessor, blastn version 1.4 can employ a concept of partial matching, such as might be used when two _Rs (purines) are aligned with each other. When the blastn scoring system is defined using the M and N parame- ters, the scoring matrix constructed by the program accounts for partial matching of nucleotide ambiguity codes. If the -matrix option is used instead, the user has complete free- dom to decide how to score alignments involving ambiguity codes. When calculating the Sum and Poisson statistics, some HSPs may be inconsistent or incompatible with one another in the same gapped alignment, and yet the programs will count them as independent, consistent events, leading to false posi- tives being reported in the output. See the -olfraction option. (However, HSPs appearing on opposite strands of the query or database sequence, or in reading frames on opposite strands, are considered separately in all cases). The nucleotide composition of a blastn query sequence is irrelevant to the values reported for the Karlin-Altschul _L_a_m_b_d_a and _K parameters. This is due to the equi-probable 0.25/0.25/0.25/0.25 A/C/G/T residue distribution assumed by blastn for the database sequences. The values of the Karlin-Altschul parameters are still affected by the scoring system employed (defined by the parameters M and N, or the -matrix option). On multiprocessor platforms, blastn restricts itself by default to using 3 processors maximum, due to the relatively high initialization cost per processor when the database contains long sequences, as compared to the brief cpu time required for searches that use the default wordlength of 11. More than 3 processors can be recruited using the P command line option. SEE ALSO blast3(1). COPYRIGHT This work is in the public domain. REFERENCES Sun Release 4.1 Last change: 20 October 1994 28 BLAST(1L) MISC. REFERENCE MANUAL PAGES BLAST(1L) Altschul, Stephen F. (1991). _A_m_i_n_o _a_c_i_d _s_u_b_s_t_i_t_u_t_i_o_n _m_a_t_r_i_c_e_s _f_r_o_m _a_n _i_n_f_o_r_m_a_t_i_o_n _t_h_e_o_r_e_t_i_c _p_e_r_s_p_e_c_t_i_v_e. J. Mol. Biol. 219:555-65. Altschul, S. F. (1993). _A _p_r_o_t_e_i_n _a_l_i_g_n_m_e_n_t _s_c_o_r_i_n_g _s_y_s_t_e_m _s_e_n_s_i_t_i_v_e _a_t _a_l_l _e_v_o_l_u_t_i_o_n_a_r_y _d_i_s_t_a_n_c_e_s. J. Mol. Evol. 36:290-300. Altschul, S. F., M. S. Boguski, W. Gish and J. C. Wootton (1994). _I_s_s_u_e_s _i_n _s_e_a_r_c_h_i_n_g _m_o_l_e_c_u_l_a_r _s_e_q_u_e_n_c_e _d_a_t_a_b_a_s_e_s. Nature Genetics 6:119-129. Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman (1990). _B_a_s_i_c _l_o_c_a_l _a_l_i_g_n_m_e_n_t _s_e_a_r_c_h _t_o_o_l. J. Mol. Biol. 215:403-10. Claverie, J.-M. and D. J. States (1993). _I_n_f_o_r_m_a_t_i_o_n _e_n_h_a_n_c_e_m_e_n_t _m_e_t_h_o_d_s _f_o_r _l_a_r_g_e _s_c_a_l_e _s_e_q_u_e_n_c_e _a_n_a_l_y_s_i_s. Com- puters in Chemistry 17:191-201. Gish, W. and D. J. States (1993). _I_d_e_n_t_i_f_i_c_a_t_i_o_n _o_f _p_r_o_t_e_i_n _c_o_d_i_n_g _r_e_g_i_o_n_s _b_y _d_a_t_a_b_a_s_e _s_i_m_i_l_a_r_i_t_y _s_e_a_r_c_h. Nature Genet- ics 3:266-72. Henikoff, Steven and Jorga G. Henikoff (1992). _A_m_i_n_o _a_c_i_d _s_u_b_s_t_i_t_u_t_i_o_n _m_a_t_r_i_c_e_s _f_r_o_m _p_r_o_t_e_i_n _b_l_o_c_k_s. Proc. Natl. Acad. Sci. USA 89:10915-19. Karlin, Samuel and Stephen F. Altschul (1990). _M_e_t_h_o_d_s _f_o_r _a_s_s_e_s_s_i_n_g _t_h_e _s_t_a_t_i_s_t_i_c_a_l _s_i_g_n_i_f_i_c_a_n_c_e _o_f _m_o_l_e_c_u_l_a_r _s_e_q_u_e_n_c_e _f_e_a_t_u_r_e_s _b_y _u_s_i_n_g _g_e_n_e_r_a_l _s_c_o_r_i_n_g _s_c_h_e_m_e_s. Proc. Natl. Acad. Sci. USA 87:2264-68. Karlin, Samuel and Stephen F. Altschul (1993). _A_p_p_l_i_c_a_t_i_o_n_s _a_n_d _s_t_a_t_i_s_t_i_c_s _f_o_r _m_u_l_t_i_p_l_e _h_i_g_h-_s_c_o_r_i_n_g _s_e_g_m_e_n_t_s _i_n _m_o_l_e_c_u_- _l_a_r _s_e_q_u_e_n_c_e_s. Proc. Natl. Acad. Sci. USA 90:5873-7. States, D. J. and W. Gish (1994). _C_o_m_b_i_n_e_d _u_s_e _o_f _s_e_q_u_e_n_c_e _s_i_m_i_l_a_r_i_t_y _a_n_d _c_o_d_o_n _b_i_a_s _f_o_r _c_o_d_i_n_g _r_e_g_i_o_n _i_d_e_n_t_i_f_i_c_a_t_i_o_n. J. Comput. Biol. 1:39-50. States, D. J., W. Gish and S. F. Altschul (1991). _I_m_p_r_o_v_e_d _s_e_n_s_i_t_i_v_i_t_y _o_f _n_u_c_l_e_i_c _a_c_i_d _d_a_t_a_b_a_s_e _s_i_m_i_l_a_r_i_t_y _s_e_a_r_c_h_e_s _u_s_i_n_g _a_p_p_l_i_c_a_t_i_o_n _s_p_e_c_i_f_i_c _s_c_o_r_i_n_g _m_a_t_r_i_c_e_s. Methods: A com- panion to Methods in Enzymology 3:66-70. Wootton, J. C. and S. Federhen (1993). _S_t_a_t_i_s_t_i_c_s _o_f _l_o_c_a_l _c_o_m_p_l_e_x_i_t_y _i_n _a_m_i_n_o _a_c_i_d _s_e_q_u_e_n_c_e_s _a_n_d _s_e_q_u_e_n_c_e _d_a_t_a_b_a_s_e_s. Computers in Chemistry 17:149-163. Sun Release 4.1 Last change: 20 October 1994 29