Next: System and methods Up: No Title Previous: Abstract

Introduction

A protein that exhibits key biological functions will commonly have homologues sequenced from many different tissues and organisms. Accurate multiple sequence alignment of such a protein family can highlight the residues of common functional and structural importance. The location of identities and conservative substitutions may be used to guide the design of site directed mutagenesis experiments, whilst the identification of subtle patterns of residue conservation can yield improvements in the accuracy of secondary and tertiary structure predictions [Benner \& Gerloff, 1990][Crawford et al., 1987][Russell et al., 1992][Barton et al., 1991][Zvelebil et al., 1987]. Such analyses of multiple sequence alignments have traditionally been performed by eye. However, for large alignments, only the most obvious patterns of residue conservation can be easily identified by this method. When many long sequences are to be scrutinised, the task becomes unmanageable, and the risk of missing interesting residue substitutions is great.

A number of computer programs have been developed to aid the interpretation of multiple sequence alignments. The programs PRETTY and PRETTYPLOT from the GCG [Devereux et al., 1984] package derive consensus amino acid sequences and box the largest group of similar residues at each position of an alignment. ALSCRIPT [Barton, 1993] allows shading, boxing and colouring to be applied to an alignment. Colour is also exploited by the SOMAP program [Parry-Smith \& Attwood, 1991] which colours residues according to which user-defined set they belong (e.g. hydrophobic, charged). The amino-acid variation at a position in an alignment is reduced to a single figure of ``variability'' by Kabat [Kabat, 1976], ``entropy'' or ``variation'' by Sander &Schneider [Sander \& Schneider, 1991] ``information'' by Smith &Smith [Smith \& Smith, 1990] and ``evolutionary divergence'' by Brouillet et al. [Brouillet et al., 1992]. In contrast, the novel set-based approach described by Taylor [Taylor, 1986], defines the minimal set of physico-chemical properties that represent any group of amino acids. This principle has been developed by Zvelebil et al. [Zvelebil et al., 1987] so that the minimal set of amino acids could be encoded as a single ``conservation number'' at each position in the alignment. Although very effective at highlighting the overall similarity at each position in an alignment, none of these methods deal with the problem of quantifying similarities between sub-families within a larger multiple sequence alignment.

It is frequently desirable to sub-divide a protein family on the basis of function, origin, sequence similarity or other criteria. Indeed, most multiple alignment methods (e.g. [Higgins \& Sharp, 1989][Feng \& Doolittle, 1987][Barton \& Sternberg, 1987][Barton, 1990]) first compare all sequences pairwise, then automatically cluster the sequences into sub-families on the basis of sequence similarity. Such cluster analysis can readily identify the gross similarities between sequences but does not pinpoint the residue positions that are responsible for the clustering pattern. It may also be difficult to rationalise the clusters identified by overall sequence similarity with those implied by functional similarity since functional differences may reside in a few key residues. Although all previous methods for characterising residue conservation (e.g. [Taylor, 1986][Parry-Smith \& Attwood, 1991][Devereux et al., 1984], [Brouillet et al., 1992][Smith \& Smith, 1990][Sander \& Schneider, 1991][Kabat, 1976]) provide a clear overview of conservation across an alignment, they do not allow the automatic identification of residue positions specific to sub-groups of sequences within the alignment.

In this paper we describe an algorithm for the systematic identification of residue conservation within aligned protein sequences. The algorithm operates in a hierarchical manner, by first characterising conservation on a residue by residue basis within pre-defined sub-families, then between all pairs of sub-families. This hierarchical approach highlights positions that may be responsible for conferring the specific structural and functional properties of the sub-families.

Next: System and methods Up: No Title Previous: Abstract

cdl@bioch.ox.ac.uk