The scoring scheme is a
matrix of numbers that defines
the value for aligning each of the possible amino acid pairs.
The term substition is often used for the alignment
of two amino acid residues, since scoring schemes are frequentlyderived
from a model of evolution that considers two protein sequences to be
related via a series of point mutations. The pair-score matrix
is usually symmetrical, since Ala aligned with Gly has the same
meaning as Gly aligned with Ala. The simplest scoring scheme is the
identity matrix. This scores 1 for an exact match of two amino acids,
and 0 for a mismatch. Although the identity matrix is appealing in
its simplicity, it does not reflect adequately similarities observed
between proteins that have similar three dimensional structures. More
sophisticated schemes take into account conservative substitutions.
For example, Val aligned with Leu might score +4, but Glu with Leu,
-3. Until recently, matrices referred to as PAM or Dayhoff were the
most widely used. PAM matrices were derived by first aligning a small
number of families of protein sequences by eye, then counting the
observed amino acid substitutions within the families and normalising
the counts before extrapolating the observed substitutions to those
expected at different evolutionary distances [Dayhoff et al., 1978]. The
measure of evolutionary distance used was the Percentage of Accepted
Mutations, or PAM, and the most commonly applied matrix was that at
250 PAMS, normally known as PAM250.
In spite of its small training-set size, the PAM250 matrix captures the principal physico-chemical properties of the amino acids [Taylor, 1986a]. Furthermore, updates to the PAM matrix obtained from much larger data sets, for example, the PET92 matrix [Jones et al., 1992], show few differences to PAM250, except for substitutions with the less common amino acids such as tryptophan.
The BLOSUM series of matrices [Henikoff & Henikoff, 1992] are also derived from an analysis of observed substitutions in protein families. Unlike PAM matrices, the starting point for BLOSUM is a set of alignments without gaps, obtained by the BLAST [Altschul et al., 1990] algorithm [Henikoff & Henikoff, 1992]. The alignments include sequences that share much lower sequence similarity than those used in the Dayhoff studies. In extensive tests of sequence database searching [Henikoff & Henikoff, 1993], pair-wise alignment [Vogt et al., 1995] and multiple sequence alignment (Raghava & Barton, 1998, submitted) the BLOSUM series of matrices on average give results superior to the PAM matrices and most other matrices. For this reason, BLOSUM matrices are now the general matrix of choice for protein sequence alignment, and are the default matrices used by most popular sequence alignment and database searching software.
Rather than starting from alignments generated by sequence comparison, Overington et al (1992) only consider proteins for which an experimentally determined three dimensional structure is available. They then align similar proteins on the basis of their structure rather than sequence and use the resulting sequence alignments as their database from which to gather substitution statistics. In principle, the Overington matrices should give more reliable results than either PAM of BLOSUM. However, the comparatively small number of available protein structures currently limits the reliability of their statistics. Overington et al (1992) develop further matrices that consider the local environment of the amino acids.