The most successful techniques for prediction of the protein three dimensional structure rely on aligning the sequence of a protein of unknown structure to a homologue of known structure (e.g. see Sali for review). Such methods fail if there is no homologue in the structural database, or if the technique for searching the structural database is unable to identify homologues that are present. While absence of a homologue must await further X-ray or NMR structures, up to 4/5 of known homologues may be missed even by the best conventional pairwise sequence comparison methods.
Techniques that exploit evolutionary information from protein families[3,4,5,6,7,8,9] or use empirical pair-potentials[10,11] can normally detect more homologues than pairwise sequence comparison methods. An even greater challenge is to detect proteins that share similar folds, but are not clearly derived from a common ancestor (e.g. Rossman fold domains of lactate dehydrogenase and glycogen phosphorylase, and SH2-BirA)
Techniques for the prediction of protein secondary structure provide information that is useful both in ab initio structure prediction and as an additional constraint for fold-recognition algorithms[13,14,15]. Knowledge of secondary structure alone can help in the design of site-directed or deletion mutants that will not destroy the native protein structure. However, for all these applications it is essential that the secondary structure prediction be accurate, or at least that, the reliability for each residue can be assessed.
The majority of secondary structure prediction algorithms derive parameters or rules from an analysis of proteins of known three dimensional structure. The parameters are then applied by the algorithm to the sequence of unknown structure. Such approaches rely on having sufficient data to obtain reliable parameters and to avoid over-training for a specific data set.
Early algorithms to predict protein secondary structure[16,17,18] claimed high accuracy for prediction, but on small datasets that were also used in training the methods. For example, Lim (1974) quoted 70% Q3 accuracy on a dataset of 25 proteins, Garnier et al.(1978) achieved 63% accuracy for a different set of 26 proteins, and Chou & Fasman (1974) quoted 77% for yet another different set of 19 proteins.
The use of different datasets in training and testing each algorithm makes it difficult to make an objective comparison of methods. For this reason, Kabsch & Sander (1983) carried out a test of prediction methods by applying the algorithms to proteins that were not used in their development. In this independent test, the GOR accuracy reduced by 7% to 56%. The Lim accuracy reduced by 14% to 56%, and Chou-Fasman dropped by 27% to 50%. Cross-validation techniques, where test proteins are removed from the training set, have allowed more realistic evaluation of prediction accuracy to be obtained.
Prediction from a multiple alignment of protein sequences rather than a single sequence has long been recognised as a way to improve prediction accuracy. During evolution, residues with similar physico-chemical properties are conserved if they are important to the fold or function of the protein. This makes patterns of hydrophobic residues characteristic of particular secondary structures easier to identify. Analysis of conservation in protein families has been effective in many secondary structure predictions performed before knowledge of the protein structure[22,23,24,25]. Zvelebil et al. (1987) developed an automatic procedure that showed a 9% improvement in prediction accuracy on a small set of protein families when multiple sequence data was included. Most current secondary structure prediction algorithms exploit similar principles to gain higher accuracy than is possible from a single sequence[27,28,29,30]. The recent CASP series of experiments in which predictions are made blind have shown that recent claims for secondary structure prediction algorithms are within reasonable limits.
Prediction accuracy has also been improved by combining more than one algorithm on a single sequence[33,34,35,36,37]. For example, Zhang et al.(1992) obtained 66.4% accuracy on a set of 107 proteins, an improvement of 2% over the best method they considered.
In this paper we describe datasets and procedures for the evaluation of current techniques for secondary structure prediction. We discuss the effects of homology within the training and test datasets and describe new non-redundant datasets appropriate for developing secondary structure prediction algorithms. We evaluate the accuracy of four recently published algorithms that exploit multiple sequence data NNSSP, PHD, DSC and PREDATOR and two older methods, ZPRED and MULPRED (Barton, unpublished). We develop an algorithm that combines the predictions of PHD, DSC, PREDATOR and NNSSP and show that it gives a 1% improvement in average accuracy over the best single method. Finally, we investigate the effect of the quality of multiple sequence alignment used in prediction, the effect of secondary structure assignment algorithm (DSSP, DEFINE and STRIDE) and influence of redundancy in the multiple alignments.