Proteins having no detectable sequence similarity can adopt similar
3D structures. Similarities are often observed for proteins having no
functional similarity, or from different kingdoms or tissues (e.g., Holm and Sander, 1993a;
Swindells et al., 1993; Russell and Barton, 1993a). Despite the
possibility for almost infinite variation at the level of the
gene, Nature is apparently restricted to a limited number of protein
folds.
Currently there are approximately proteins of known 3D structure, which
can be classified further into approximately
unique fold families [Orengo et al., 1993]. A fold family is a collection of proteins
having similar 3D structures, but not necessarily any sequence or functional
similarity. Many families contain members with no common features
across their sequences (for example, the
-barrel, greek key
-barrel and
jelly-roll folds).
Here, to simplify discussion, we introduce a three state classification of protein 3D structural
similarities. At one extreme (type ) are pairs of proteins sharing sequence, structural and
(usually) functional similarity. Type
similarities include the globins, mammalian serine proteinases,
Ig variable domains and cytochromes
. In the middle (type
) are those proteins having structural and
functional similarity, but little sequence similarity, such as the mammalian and bacterial serine proteinases,
azurin/plastocyanin, the Rossmann fold dehydrogenases (e.g., lactate, alcohol, etc.), Ig domains and CD4, aspartic
proteinase lobes, rhodanese domains, and the heat shock protein/actin fold. Finally, at the other extreme
(type
), are proteins with only 3D structural similarity, such as the
Rossmann fold domains (e.g., lactate DH and glycogen phosphorylase),
barrels, and greek key
barrels (e.g., Ig domains, azurin, superoxide dismutase, etc.).
Families of types (
and
) often contain members with some structural differences, and with large insertions
required to align structures accurately (e.g., hemacyanin compared to superoxide dismutase; or the
barrels
from aldolase and rubisco). Since functional similarity is difficult to define, the divisions between each type
are not discrete, though the three categories provide a convenient means for classifying an observed structural
similarity. The frequently used
terms ``homologous'' and ``analogous'' probably define types
and
respectively, with
falling somewhere in between. When comparing protein sequences or 3D structures,
generally, type
and some type
similarities are detectable by sequence comparison methods
(see Argos, 1991 for a review)
though many type
similarities are undetectable unless one considers
3D structure or functional information for one member of the family
(e.g., Barton &Sternberg, 1990; Bowie et al., 1991; Jones et al., 1992).
Structural similarities of type
are usually only detectable when both
3D structures are considered [Russell \& Barton, 1992][Sali \& Blundell, 1990][Taylor, 1989][Mitchell et al., 1989], with some notable exceptions
[Godzik et al., 1993][Jones et al., 1992].
Protein structural families frequently contain similarities spanning types
through
. Figure 1 shows an example for the family of greek key
barrel structures. The Figure
shows three similar pairs: a) two Ig light chain variable domains (
), which share functional
and sequence similarity; b) an Ig light chain variable domain and the N-terminal domain of
CD4 (
), which are both immune system recognition proteins;
and c) an Ig light chain variable domain and poplar plastocyanin (
), which are similar only in that they
have a similar arrangement of seven
strands.
Despite dozens of examples of similarities of types and
, little
is understood as to why different sequences can adopt similar 3D structures.
Most studies to date have dealt with specific families of proteins having functional
similarity (i.e., types
and
), such as the globins [Bordo \& Argos, 1991][Bordo \& Argos, 1990][Pastore et al., 1988][Bashford et al., 1987][Lesk \& Chothia, 1980],
the Ig domains [Chothia \& Lesk, 1982], blue copper (plastocyanin-like) photosynthetic proteins
[Adman, 1984][Lesk \& Chothia, 1982], nucleotide binding folds [Otto et al., 1980][Rossmann \& Argos, 1976][Rossmann et al., 1974],
oligonucleotide/oligosaccharide binding folds [Murzin, 1993][Sixma et al., 1993],
proteinases [Craik et al., 1983][Blundell et al., 1979] or
hydrolases [Ollis et al., 1992]. However,
some studies have considered more distantly related protein 3D structures (i.e., type
), such as
greek key
barrels [Hutchinson \& Thornton, 1992][Hazes \& Hol, 1992], globin/phycocyanin/colicin A
[Holm \& Sander, 1993b][Pastore \& Lesk, 1990],
barrels [Farber, 1993][Farber \& Petsko, 1990],
trefoils [Swindells \& Thornton, 1993][Murzin et al., 1992],
toxin-agglutinin folds [Drenth et al., 1980] or jelly-roll folds [Chelvanayagam et al., 1992].
Though the details of such studies differ, they generally suggest functional and packing
features common to a particular family, though they provide few generalisations
that might be applied to other protein structural families.
Similarities of type
(and some of type
) have common features in the protein cores and around
common binding or active sites. Similarities of type
(and some of type
) often have few
common features. For example, even the most distantly related oxygen carrying globin folds (type
and
similarities) share haem binding residues as well as several key hydrophobic core residues
[Pastore et al., 1988][Bashford et al., 1987]. However, when adds to the family the structurally similar, but
functionally different, phycocyanin and colicin A structures, few common residues
can be found [Holm \& Sander, 1993b][Pastore \& Lesk, 1990].
There have been some investigations into the general features of structurally similar proteins.
Chothia &Lesk (1986) considered 32 pairs of structures and found that
distantly related proteins could have as little as of their structures in a common core.
They also found a logarithmic relationship between sequence identity and RMS deviation on core
main-chain atoms; RMS deviation increased exponentially with decreasing sequence identity.
Pascarella and Argos (1992) considered families of protein 3D
structures and established general rules for the occurrence of insertions and deletions (e.g., that
they prefer to be between 1-5 residues, and rarely occur within helices or strands). Flores et al. (1993)
examined how RMS deviation, number of
to
contacts, solvent accessibility
angle and secondary structure behaved as a function of sequence identity for 90 pairs
of structurally similar proteins. They found an approximately inverse linear relationship
between the variation of all of these properties and sequence identity.
For pairs of structures having a similar sequence identity, they found little difference
between ``homologous'' (i.e., type
and some type
similarities) and ``analogous'' (i.e., some
type
and type
similarities) proteins.
Detection of type and
similarities prior to 3D structural determination is of great interest,
since detection and alignment can avail tertiary structure information via homology modelling, and
can suggest experiments to determine biological function.
In an attempt to detect more type
and
similarities than is possible by sequence comparison,
many methods for providing the best fit of a sequence to a structure have been described
(Sippl, 1990; Bowie et al., 1991; Luthy et al., 1991; Overington et al., 1992; Jones et al.,
1992; Johnson et al., 1993; Bryant &Lawrence, 1993; Godzik et al., 1993; Wilmanns &Eisenberg,
1993; see Bowie &Eisenberg, 1993 or Wodak &Rooman, 1993 for reviews).
These methods have been inspired by the earlier work of Novotný et al. (1984, 1988), which showed that purposefully misfolded proteins gave rise to favourable energies using CHARMm parameters [Brooks et al., 1983]. Novotný et al. found that the misfolded proteins had more hydrophobic residues exposed to solvent and more buried ionisable side-chains. Though the details differ, most methods for fitting sequence to 3D structure provide a measure of the quality of the fit based on one or more of: a) accessibility preferences; b) loop solvation potentials; c) secondary structure preferences; and d) amino acid pair preferences (discussed below). Sippl (1990) first suggested the use of amino acid pair preferences (derived from analysis of known protein 3D structures) for measuring sequence and structure compatibility. Pair preferences provide a measure of how likely each type of amino acid is to interact with every other type, and can be used to assess the quality of the fit of a sequence to a 3D structure if one threads a sequence onto the known structure. Optimal sequence threading involves getting the best alignment of sequence and 3D structure by a consideration of such pair preferences. The use of pair preferences means that threading, unlike most methods of protein sequence alignment, is a 3D problem, since moving residues along the sequence in one region of the structure can affect residues separated by a long length of sequence. Several threading algorithms for protein fold detection have been described [Bryant \& Lawrence, 1993][Sippl \& Weitckus, 1992][Godzik et al., 1993][Jones et al., 1992].
Methods of protein fold detection
have met with some success, being able to detect similarities (and provide accurate sequence alignments) between
proteins having little sequence similarity, but which are known to adopt a similar 3D structure.
However, most of the success appears to be associated with aligning structures of similarity types and
.
Many type
and
similarities remain difficult to detect or align accurately, particularly when
pair preferences are not used. For example, the 3D-1D method of Bowie et al. (1991) is apparently
unable to detect the similarities between hexokinase and actin [Thornton et al., 1991][Bowie et al., 1991], between enterotoxin
verotoxin [Sixma et al., 1993], or between various
barrels [Pickett et al., 1992].
However, the use of pair preferences can enable detection and alignment of several type
and
similarities. For example, the method of Jones et al. (1992) accurately found
myohemerythrin to be a plausible fold for cytochrome B562 by threading the B562 sequence onto each of a
database of
representative folds,
despite the lack of sequence or functional similarity between these proteins. The method of
Godzik et al. (1993) detected the similarity between the plastocyanin and immunoglobulin structures by
using a template derived from the plastocyanin structure to search a sequence database.
An assumption common to fold detection methods is that certain structural features (such as those
described by Novotný et al. and Sippl) are conserved or shared across proteins having similar
3D structures, even in the absense of sequence similarity.
In order for these methods to be successful, secondary structure, accessibility and/or
particular side-chain to side-chain
interactions must be conserved across similar protein 3D structures.
To date, there has been no investigation as to the conservation of particular side-chain
properties within distantly related proteins. Studies have concentrated on closely
related protein 3D structures (i.e., type or some type
similarities), and these have been used
to derive environment specific parameters for side-chain substitutions
[Johnson et al., 1993][Overington et al., 1992][Bowie et al., 1991][Luthy et al., 1991][Bowie et al., 1990][Overington et al., 1990]. The high degree of similarity
in the proteins used to derive the parameters need not necessarily apply to more distantly related
protein pairs, which is, perhaps, why these methods appear to make only a marginal improvement over
methods which do not make use of 3D structural data [Henikoff \& Henikoff, 1993][Barton \& Sternberg, 1990][Lipman \& Pearson, 1985][Gribskov et al., 1987][Taylor, 1986b].
In this paper, protein 3D structural alignments are used to investigate the
conservation of side-chain accessibility, secondary structure and side-chain to side-chain
interactions within protein 3D structure pairs having a range of similarities (i.e., types
-
-
). The importance of the results for protein fold detection methods and
protein evolution is discussed.