Next: PET91 - An Up: Observed substitutions Previous: Observed substitutions

The Dayhoff mutation data matrix

Possibly the most widely used scheme for scoring amino acid pairs is that developed by Dayhoff and co-workers [1]. The system arose out of a general model for the evolution of proteins. Dayhoff and co workers examined alignments of closely similar sequences where the the likelihood of a particular mutation (e. A-D) being the result of a set of successive mutations (eg. A-x-y-D) was low. Since relatively few families were considered, the resulting matrix of accepted point mutations included a large number of entries equal to 0 or 1. A complete picture of the mutation process including those amino acids which did not change was determined by calculating the average ratio of the number of changes a particular amino acid type underwent to the total number of amino acids of that type present in the database. This was combined with the point mutation data to give the mutation probability matrix () where each element gives the probability of the amino acid in column mutating to the amino acid in row after a particular evolutionary time, for example after 2 PAM (Percentage of Acceptable point Mutations per years).

The mutation probability matrix is specific for a particular evolutionary distance, but may be used to generate matrices for greater evolutionary distances by multiplying it repeatedly by itself. At the level of 2,000 PAM Schwartz and Dayhoff suggest that all the information present in the matrix has degenerated except that the matrix element for Cys-Cys is 10%higher than would be expected by chance. At the evolutionary distance of 256 PAMs one amino acid in five remains unchanged but the amino acids vary in their mutability; 48%of the tryptophans, 41%of the cysteines and 20%of the histidines would be unchanged, but only 7%of serines would remain.

When used for the comparison of protein sequences, the mutation probability matrix is usually normalised by dividing each element by the relative frequency of exposure to mutation of the amino acid . This operation results in the symmetrical ``relatedness odds matrix'' with each element giving the probability of amino acid replacement per occurrence of per occurrence of . The logarithm of each element is taken to allow probabilities to be summed over a series of amino acids rather than requiring multiplication. The resulting matrix is the ``log-odds matrix'' which is frequently referred to as ``Dayhoff's matrix'' and often used at a distance of close to 256 PAM since this lies near to the limit of detection of distant relationships where approximately 80%of the amino acid positions are observed to have changed [2].



Next: PET91 - An Up: Observed substitutions Previous: Observed substitutions


geoff.barton@ox.ac.uk