The scoring scheme, substitution or pair-score matrix

Next: Dealing with Gaps Up: Alignment of two sequences Previous: Alignment of two sequences

The scoring scheme, substitution or pair-score matrix

The scoring scheme is a $20 \times 20$ matrix of numbers that defines the value for aligning each of the possible amino acid pairs. The term substition is often used for the alignment of two amino acid residues, since scoring schemes are frequentlyderived from a model of evolution that considers two protein sequences to be related via a series of point mutations. The pair-score matrix is usually symmetrical, since Ala aligned with Gly has the same meaning as Gly aligned with Ala. The simplest scoring scheme is the identity matrix. This scores 1 for an exact match of two amino acids, and 0 for a mismatch. Although the identity matrix is appealing in its simplicity, it does not reflect adequately similarities observed between proteins that have similar three dimensional structures. More sophisticated schemes take into account conservative substitutions. For example, Val aligned with Leu might score +4, but Glu with Leu, -3. Until recently, matrices referred to as PAM or Dayhoff were the most widely used. PAM matrices were derived by first aligning a small number of families of protein sequences by eye, then counting the observed amino acid substitutions within the families and normalising the counts before extrapolating the observed substitutions to those expected at different evolutionary distances [Dayhoff et al., 1978]. The measure of evolutionary distance used was the Percentage of Accepted Mutations, or PAM, and the most commonly applied matrix was that at 250 PAMS, normally known as PAM250.

In spite of its small training-set size, the PAM250 matrix captures the principal physico-chemical properties of the amino acids [Taylor, 1986a]. Furthermore, updates to the PAM matrix obtained from much larger data sets, for example, the PET92 matrix [Jones et al., 1992], show few differences to PAM250, except for substitutions with the less common amino acids such as tryptophan.

The BLOSUM series of matrices [Henikoff & Henikoff, 1992] are also derived from an analysis of observed substitutions in protein families. Unlike PAM matrices, the starting point for BLOSUM is a set of alignments without gaps, obtained by the BLAST [Altschul et al., 1990] algorithm [Henikoff & Henikoff, 1992]. The alignments include sequences that share much lower sequence similarity than those used in the Dayhoff studies. In extensive tests of sequence database searching [Henikoff & Henikoff, 1993], pair-wise alignment [Vogt et al., 1995] and multiple sequence alignment (Raghava & Barton, 1998, submitted) the BLOSUM series of matrices on average give results superior to the PAM matrices and most other matrices. For this reason, BLOSUM matrices are now the general matrix of choice for protein sequence alignment, and are the default matrices used by most popular sequence alignment and database searching software.

Rather than starting from alignments generated by sequence comparison, Overington et al (1992) only consider proteins for which an experimentally determined three dimensional structure is available. They then align similar proteins on the basis of their structure rather than sequence and use the resulting sequence alignments as their database from which to gather substitution statistics. In principle, the Overington matrices should give more reliable results than either PAM of BLOSUM. However, the comparatively small number of available protein structures currently limits the reliability of their statistics. Overington et al (1992) develop further matrices that consider the local environment of the amino acids.

Next: Dealing with Gaps Up: Alignment of two sequences Previous: Alignment of two sequences

geoff@ebi.ac.uk