The flood of new protein sequences demands techniques to infer protein
3D structure from sequence alone. For
%of protein sequences, conventional alignment techniques
(e.g. [Smith \& Waterman, 1981][Altschul et al., 1990][Lipman \& Pearson, 1985]) or profile and pattern
methods (e.g. [Barton \& Sternberg, 1990][Gribskov et al., 1987]) find similarities to a protein
of known 3D structure [Chothia, 1992]. The remaining
70%of protein sequences may adopt previously unseen protein folds.
Alternatively, they may have topologies (folds) similar to known protein
structures but share no detectable sequence similarity [Russell \& Barton, 1994]. Such
fold similarities will normally not be found until both protein 3D
structures have been determined experimentally
[Holm \& Sander, 1994a][Orengo, 1994]. In an attempt to find fold similarities of this type
in advance of 3D structure determination, several fold recognition
techniques have been developed. (see [Jones \& Thornton, 1993][Wodak \& Rooman, 1993][Bowie \& Eisenberg, 1993]
and refs therein.) These techniques may locate some fold similarities that
are undetectable by the comparison of sequence. However, the methods
are often computationally intensive and many similarities still go
un-detected [Lemer et al., 1996][Pickett et al., 1992].
In parallel with the development of fold detection methods, the
accuracy of secondary structure prediction has improved from %to
%on average. Though this is only a small
percentage increase, recent predictions are more useful, since the application
of multiple sequence alignments improve the
identification of the number, type and location of core secondary
structure elements. Prediction from sequence alignments can also
accurately identify the position of loops, and residues likely to be
buried in the the protein core
[Russell \& Sternberg, 1995][Barton, 1995][Benner et al., 1994]. Given a good secondary structure
prediction, the next question to ask is how the secondary structures might
be arranged into a tertiary fold. ab initio methods for folding
secondary into tertiary structure
search for possible arrangements of secondary structures
that obey general packing rules
[Sun et al., 1995][Smith-Brown et al., 1993][Cohen et al., 1982][Cohen et al., 1980][Cohen \& Sternberg, 1980].
These methods have been applied in numerous blind predictions
[Huang et al., 1994][Jin et al., 1994][Curtis et al., 1991][Cohen et al., 1986][Hurle et al., 1987] with varied results.
A limitation is the number of packing combinations that must be
considered. This can become unmanageable for
secondary structures
[Cohen et al., 1982], though approaches to reduce the number of combinations
have been described [Clark et al., 1991][Taylor, 1991].
The most successful predictions of protein tertiary structure in the
absence of clear sequence similarity to a protein of known 3D
structure, have been those where secondary structure predictions,
and/or experimental information were combined to suggest resemblance
to an already known fold. Correct folds have been predicted in this
way for the subunit of tryptophan synthase
[Crawford et al., 1987], a family of cytokines [Bazan, 1990], and
recently, for the von Willebrand factor type A domain
[Edwards \& Perkins, 1995], and the synaptotagmin C2 domain [Gerloff et al., 1995].
Although the details of these studies differed, all used predicted
secondary structures from multiple alignment, combined with the
careful application of protein structural principles (often together
with experimental data) to suggest a protein fold. Two automated
methods for comparing predicted and experimental secondary structures
have been described previously [Rost, 1995][Sheridan et al., 1985] with promising
though limited preliminary results.
In this paper we show how secondary structure and accessibility
prediction together with basic rules of protein structure may be used
to find the correct fold within a database of protein structural
domains. The method first generates all possible matches (referred to
as `maps') between query and database secondary structure patterns,
allowing for insertions and deletions of whole secondary structure
elements. Maps are filtered by a series of structural criteria to
arrive at a collection of sensible template structures. The sequence
of the query protein is then aligned to the template structures by
matching predicted and observed patterns of residue accessibility.
Finally, alignments are ranked by a score that combines accessibility
matching with a penalty for differences in secondary structure length.
The method is designed to cope with incorrect secondary structure
assignments, insertions/deletions of whole secondary structure
elements, and differences in the lengths and orientations of secondary
structures.