The procedures described in the previous section are a straghtforward extension of the principles described by Zvelebil et al. [Zvelebil et al., 1987] and Taylor [Taylor, 1986]. Here we extend the set based method to identify conserved features of sequence sub-groups within larger protein sequence alignments.
The starting point for hierarchical conservation analysis is the identification of two or more sub-sets of sequences within a multiple sequence alignment. The subsets may be defined by grouping on the basis of overall sequence similarity, by functional similarity, origin, or other criteria. Given such groupings, the aim is to highlight which residue positions define the unique properties of each group.
Figure 3 and 4 illustrate the result of applying hierarchical conservation analysis to a nine residue fragment of a 26 sequence multiple alignment using the 10 property index shown in Figure 1. The dendrogram shown at the left of Figure 3 shows the overall similarity between the sequences (i.e. not just the 9 residues) and clearly splits the sequences into three sub-groups labelled A, B and C.
Conservation numbers are calculated for each alignment position in each sub-group and a conservation threshold is set. This reference point is used to put each position within a sub-group into one of three classes: (1) Identical positions; (2) conserved positions, where the conservation number is greater than or equal to the threshold; and (3) unconserved, where the conservation number is less than the threshold. The choice of threshold depends upon the particular conservation index being used. For the index shown in Figure 1, a threshold of between 6 and 8 normally gives the most informative results.
In Figure 3, the different classifications using a threshold of 8 are illustrated by shading and font changes. For example, in sub-group A, identities are shown in white on dark grey at positions 2 and 4, conserved positions are in black on light grey, (positions 6-9), and unconserved positions are illustrated in italics on a white background (positions 3 and 5). At position 1, the identity in all sequences is marked by white on black lettering, whilst at position 10 chancery script lettering is used to highlight the lack of conservation within all sub-groups.
Having classified the conservation within each sub-group, all pairs of sub-families are compared and conservation numbers calculated for each position in the pairs. In the calculation of conservation for a pair of sub-families, the residues from the pair are considered as members of a single group. is then calcualted, as described above, for the composite group according to which method was chosen. The change in conservation value that occurs when each pair of sub-families is brought together reflects the similarities or differences in physico-chemical properties seen in each sub-group at that position. For example, at position 7 of sub-families A and B the conservation values in A, B and A + B are 9, showing that the properties are conserved within each family, and across both families at this position. This is, therefore, a location that exhibits common physico-chemical properties between A and B, yet these properties are not conserved within group C. Accordingly, this may indicate a tertiary structural feature shared between A and B, but not C.
In contrast, at position 8 of sub-groups A and C, in order to "visit" all members of the combined set of amino acids from A + C (DEQR) a minimum of 4 set borders must be crossed, giving a value of as 10-4=6. The conservation values for A, C and A + C are, therefore 9,8 and 6 respectively. Thus, although properties are conserved within each sub-group at this position, the properties that are conserved differ between the sub-groups. This type of conservation pattern might highlight a position in the protein structure that defines the specificity for a substrate. For example, the switch from a predominantly -ve to +ve charge between groups A and C may signal increased binding for a -ve charged moiety for the group C sequences when compared to group A.
General rules for linking such substitution patterns to changes in three-dimensional structure or function are as yet unknown. However, changes in conservation of charge, hydrophobicity or amino acid size are likely to be of importance in all protein families.
The result of the pairwise comparison of sub-families is summarised below the alignment in Figure 3. The conservation values for the pairs of sub-groups are either displayed as similarities or differences according to the rules shown in Table I. The similarity and difference sections are also summarised as histograms.
The hierarchical clustering approach addresses the problem of how to weight the information content of each sequence in an alignment. At the simplest level each sequence would be treated equally but this relies on the sequences being equally diverse throughout the alignment. The use of clustering to derive conservation patterns ensures equal weight is given to different groups of proteins irrespective of the number of examples of each type. Inevitably, this process involves the loss of information about the minor sequence variation which is responsible for subtle differences in character between similar proteins in a sub-group. This loss is balanced by the ability to detect the more substantial changes in conservation which determine the differences in properties between the separate sub-groups.