Next: Methods Up: No Title Previous: Contents

Introduction

Proteins having no detectable sequence similarity can adopt similar 3D structures. Similarities are often observed for proteins having no functional similarity, or from different kingdoms or tissues (e.g., Holm and Sander, 1993a; Swindells et al., 1993; Russell and Barton, 1993a). Despite the possibility for almost infinite variation at the level of the gene, Nature is apparently restricted to a limited number of protein folds.

Currently there are approximately proteins of known 3D structure, which can be classified further into approximately unique fold families [Orengo et al., 1993]. A fold family is a collection of proteins having similar 3D structures, but not necessarily any sequence or functional similarity. Many families contain members with no common features across their sequences (for example, the -barrel, greek key -barrel and jelly-roll folds).

Here, to simplify discussion, we introduce a three state classification of protein 3D structural similarities. At one extreme (type ) are pairs of proteins sharing sequence, structural and (usually) functional similarity. Type similarities include the globins, mammalian serine proteinases, Ig variable domains and cytochromes . In the middle (type ) are those proteins having structural and functional similarity, but little sequence similarity, such as the mammalian and bacterial serine proteinases, azurin/plastocyanin, the Rossmann fold dehydrogenases (e.g., lactate, alcohol, etc.), Ig domains and CD4, aspartic proteinase lobes, rhodanese domains, and the heat shock protein/actin fold. Finally, at the other extreme (type ), are proteins with only 3D structural similarity, such as the Rossmann fold domains (e.g., lactate DH and glycogen phosphorylase), barrels, and greek key barrels (e.g., Ig domains, azurin, superoxide dismutase, etc.). Families of types ( and ) often contain members with some structural differences, and with large insertions required to align structures accurately (e.g., hemacyanin compared to superoxide dismutase; or the barrels from aldolase and rubisco). Since functional similarity is difficult to define, the divisions between each type are not discrete, though the three categories provide a convenient means for classifying an observed structural similarity. The frequently used terms ``homologous'' and ``analogous'' probably define types and respectively, with falling somewhere in between. When comparing protein sequences or 3D structures, generally, type and some type similarities are detectable by sequence comparison methods (see Argos, 1991 for a review) though many type similarities are undetectable unless one considers 3D structure or functional information for one member of the family (e.g., Barton &Sternberg, 1990; Bowie et al., 1991; Jones et al., 1992). Structural similarities of type are usually only detectable when both 3D structures are considered [Russell \& Barton, 1992][Sali \& Blundell, 1990][Taylor, 1989][Mitchell et al., 1989], with some notable exceptions [Godzik et al., 1993][Jones et al., 1992]. Protein structural families frequently contain similarities spanning types through . Figure 1 shows an example for the family of greek key barrel structures. The Figure shows three similar pairs: a) two Ig light chain variable domains (), which share functional and sequence similarity; b) an Ig light chain variable domain and the N-terminal domain of CD4 (), which are both immune system recognition proteins; and c) an Ig light chain variable domain and poplar plastocyanin (), which are similar only in that they have a similar arrangement of seven strands.

Despite dozens of examples of similarities of types and , little is understood as to why different sequences can adopt similar 3D structures. Most studies to date have dealt with specific families of proteins having functional similarity (i.e., types and ), such as the globins [Bordo \& Argos, 1991][Bordo \& Argos, 1990][Pastore et al., 1988][Bashford et al., 1987][Lesk \& Chothia, 1980], the Ig domains [Chothia \& Lesk, 1982], blue copper (plastocyanin-like) photosynthetic proteins [Adman, 1984][Lesk \& Chothia, 1982], nucleotide binding folds [Otto et al., 1980][Rossmann \& Argos, 1976][Rossmann et al., 1974], oligonucleotide/oligosaccharide binding folds [Murzin, 1993][Sixma et al., 1993], proteinases [Craik et al., 1983][Blundell et al., 1979] or hydrolases [Ollis et al., 1992]. However, some studies have considered more distantly related protein 3D structures (i.e., type ), such as greek key barrels [Hutchinson \& Thornton, 1992][Hazes \& Hol, 1992], globin/phycocyanin/colicin A [Holm \& Sander, 1993b][Pastore \& Lesk, 1990], barrels [Farber, 1993][Farber \& Petsko, 1990], trefoils [Swindells \& Thornton, 1993][Murzin et al., 1992], toxin-agglutinin folds [Drenth et al., 1980] or jelly-roll folds [Chelvanayagam et al., 1992]. Though the details of such studies differ, they generally suggest functional and packing features common to a particular family, though they provide few generalisations that might be applied to other protein structural families. Similarities of type (and some of type ) have common features in the protein cores and around common binding or active sites. Similarities of type (and some of type ) often have few common features. For example, even the most distantly related oxygen carrying globin folds (type and similarities) share haem binding residues as well as several key hydrophobic core residues [Pastore et al., 1988][Bashford et al., 1987]. However, when adds to the family the structurally similar, but functionally different, phycocyanin and colicin A structures, few common residues can be found [Holm \& Sander, 1993b][Pastore \& Lesk, 1990].

There have been some investigations into the general features of structurally similar proteins. Chothia &Lesk (1986) considered 32 pairs of structures and found that distantly related proteins could have as little as of their structures in a common core. They also found a logarithmic relationship between sequence identity and RMS deviation on core main-chain atoms; RMS deviation increased exponentially with decreasing sequence identity. Pascarella and Argos (1992) considered families of protein 3D structures and established general rules for the occurrence of insertions and deletions (e.g., that they prefer to be between 1-5 residues, and rarely occur within helices or strands). Flores et al. (1993) examined how RMS deviation, number of to contacts, solvent accessibility angle and secondary structure behaved as a function of sequence identity for 90 pairs of structurally similar proteins. They found an approximately inverse linear relationship between the variation of all of these properties and sequence identity. For pairs of structures having a similar sequence identity, they found little difference between ``homologous'' (i.e., type and some type similarities) and ``analogous'' (i.e., some type and type similarities) proteins.

Detection of type and similarities prior to 3D structural determination is of great interest, since detection and alignment can avail tertiary structure information via homology modelling, and can suggest experiments to determine biological function. In an attempt to detect more type and similarities than is possible by sequence comparison, many methods for providing the best fit of a sequence to a structure have been described (Sippl, 1990; Bowie et al., 1991; Luthy et al., 1991; Overington et al., 1992; Jones et al., 1992; Johnson et al., 1993; Bryant &Lawrence, 1993; Godzik et al., 1993; Wilmanns &Eisenberg, 1993; see Bowie &Eisenberg, 1993 or Wodak &Rooman, 1993 for reviews).

These methods have been inspired by the earlier work of Novotný et al. (1984, 1988), which showed that purposefully misfolded proteins gave rise to favourable energies using CHARMm parameters [Brooks et al., 1983]. Novotný et al. found that the misfolded proteins had more hydrophobic residues exposed to solvent and more buried ionisable side-chains. Though the details differ, most methods for fitting sequence to 3D structure provide a measure of the quality of the fit based on one or more of: a) accessibility preferences; b) loop solvation potentials; c) secondary structure preferences; and d) amino acid pair preferences (discussed below). Sippl (1990) first suggested the use of amino acid pair preferences (derived from analysis of known protein 3D structures) for measuring sequence and structure compatibility. Pair preferences provide a measure of how likely each type of amino acid is to interact with every other type, and can be used to assess the quality of the fit of a sequence to a 3D structure if one threads a sequence onto the known structure. Optimal sequence threading involves getting the best alignment of sequence and 3D structure by a consideration of such pair preferences. The use of pair preferences means that threading, unlike most methods of protein sequence alignment, is a 3D problem, since moving residues along the sequence in one region of the structure can affect residues separated by a long length of sequence. Several threading algorithms for protein fold detection have been described [Bryant \& Lawrence, 1993][Sippl \& Weitckus, 1992][Godzik et al., 1993][Jones et al., 1992].

Methods of protein fold detection have met with some success, being able to detect similarities (and provide accurate sequence alignments) between proteins having little sequence similarity, but which are known to adopt a similar 3D structure. However, most of the success appears to be associated with aligning structures of similarity types and . Many type and similarities remain difficult to detect or align accurately, particularly when pair preferences are not used. For example, the 3D-1D method of Bowie et al. (1991) is apparently unable to detect the similarities between hexokinase and actin [Thornton et al., 1991][Bowie et al., 1991], between enterotoxin verotoxin [Sixma et al., 1993], or between various barrels [Pickett et al., 1992]. However, the use of pair preferences can enable detection and alignment of several type and similarities. For example, the method of Jones et al. (1992) accurately found myohemerythrin to be a plausible fold for cytochrome B562 by threading the B562 sequence onto each of a database of representative folds, despite the lack of sequence or functional similarity between these proteins. The method of Godzik et al. (1993) detected the similarity between the plastocyanin and immunoglobulin structures by using a template derived from the plastocyanin structure to search a sequence database.

An assumption common to fold detection methods is that certain structural features (such as those described by Novotný et al. and Sippl) are conserved or shared across proteins having similar 3D structures, even in the absense of sequence similarity. In order for these methods to be successful, secondary structure, accessibility and/or particular side-chain to side-chain interactions must be conserved across similar protein 3D structures. To date, there has been no investigation as to the conservation of particular side-chain properties within distantly related proteins. Studies have concentrated on closely related protein 3D structures (i.e., type or some type similarities), and these have been used to derive environment specific parameters for side-chain substitutions [Johnson et al., 1993][Overington et al., 1992][Bowie et al., 1991][Luthy et al., 1991][Bowie et al., 1990][Overington et al., 1990]. The high degree of similarity in the proteins used to derive the parameters need not necessarily apply to more distantly related protein pairs, which is, perhaps, why these methods appear to make only a marginal improvement over methods which do not make use of 3D structural data [Henikoff \& Henikoff, 1993][Barton \& Sternberg, 1990][Lipman \& Pearson, 1985][Gribskov et al., 1987][Taylor, 1986b].

In this paper, protein 3D structural alignments are used to investigate the conservation of side-chain accessibility, secondary structure and side-chain to side-chain interactions within protein 3D structure pairs having a range of similarities (i.e., types - - ). The importance of the results for protein fold detection methods and protein evolution is discussed.

Next: Methods Up: No Title Previous: Contents

gjb@
Thu Feb 9 18:06:48 GMT 1995