Next: Methods Up: No Title Previous: abstract

Introduction

The flood of new protein sequences demands techniques to infer protein 3D structure from sequence alone. For %of protein sequences, conventional alignment techniques (e.g. [Smith \& Waterman, 1981][Altschul et al., 1990][Lipman \& Pearson, 1985]) or profile and pattern methods (e.g. [Barton \& Sternberg, 1990][Gribskov et al., 1987]) find similarities to a protein of known 3D structure [Chothia, 1992]. The remaining 70%of protein sequences may adopt previously unseen protein folds. Alternatively, they may have topologies (folds) similar to known protein structures but share no detectable sequence similarity [Russell \& Barton, 1994]. Such fold similarities will normally not be found until both protein 3D structures have been determined experimentally [Holm \& Sander, 1994a][Orengo, 1994]. In an attempt to find fold similarities of this type in advance of 3D structure determination, several fold recognition techniques have been developed. (see [Jones \& Thornton, 1993][Wodak \& Rooman, 1993][Bowie \& Eisenberg, 1993] and refs therein.) These techniques may locate some fold similarities that are undetectable by the comparison of sequence. However, the methods are often computationally intensive and many similarities still go un-detected [Lemer et al., 1996][Pickett et al., 1992].

In parallel with the development of fold detection methods, the accuracy of secondary structure prediction has improved from %to %on average. Though this is only a small percentage increase, recent predictions are more useful, since the application of multiple sequence alignments improve the identification of the number, type and location of core secondary structure elements. Prediction from sequence alignments can also accurately identify the position of loops, and residues likely to be buried in the the protein core [Russell \& Sternberg, 1995][Barton, 1995][Benner et al., 1994]. Given a good secondary structure prediction, the next question to ask is how the secondary structures might be arranged into a tertiary fold. ab initio methods for folding secondary into tertiary structure search for possible arrangements of secondary structures that obey general packing rules [Sun et al., 1995][Smith-Brown et al., 1993][Cohen et al., 1982][Cohen et al., 1980][Cohen \& Sternberg, 1980]. These methods have been applied in numerous blind predictions [Huang et al., 1994][Jin et al., 1994][Curtis et al., 1991][Cohen et al., 1986][Hurle et al., 1987] with varied results. A limitation is the number of packing combinations that must be considered. This can become unmanageable for secondary structures [Cohen et al., 1982], though approaches to reduce the number of combinations have been described [Clark et al., 1991][Taylor, 1991].

The most successful predictions of protein tertiary structure in the absence of clear sequence similarity to a protein of known 3D structure, have been those where secondary structure predictions, and/or experimental information were combined to suggest resemblance to an already known fold. Correct folds have been predicted in this way for the subunit of tryptophan synthase [Crawford et al., 1987], a family of cytokines [Bazan, 1990], and recently, for the von Willebrand factor type A domain [Edwards \& Perkins, 1995], and the synaptotagmin C2 domain [Gerloff et al., 1995]. Although the details of these studies differed, all used predicted secondary structures from multiple alignment, combined with the careful application of protein structural principles (often together with experimental data) to suggest a protein fold. Two automated methods for comparing predicted and experimental secondary structures have been described previously [Rost, 1995][Sheridan et al., 1985] with promising though limited preliminary results.

In this paper we show how secondary structure and accessibility prediction together with basic rules of protein structure may be used to find the correct fold within a database of protein structural domains. The method first generates all possible matches (referred to as `maps') between query and database secondary structure patterns, allowing for insertions and deletions of whole secondary structure elements. Maps are filtered by a series of structural criteria to arrive at a collection of sensible template structures. The sequence of the query protein is then aligned to the template structures by matching predicted and observed patterns of residue accessibility. Finally, alignments are ranked by a score that combines accessibility matching with a penalty for differences in secondary structure length. The method is designed to cope with incorrect secondary structure assignments, insertions/deletions of whole secondary structure elements, and differences in the lengths and orientations of secondary structures.

Next: Methods Up: No Title Previous: abstract

gjb@bioch.ox.ac.uk