next up previous contents
Next: Acknowledgements Up: No Title Previous: Improving the consensus prediction

Summary and Conclusions

In this study we have developed a new, non redundant test set of 396 protein domains (CB396). The set does not include any of the 126 proteins with which many current methods have been trained, nor does it contain homologues of those 126 proteins as measured by a stringent test of sequence similarity. We have shown that by combining four secondary structure prediction methods DSC[29], PHD[27], PREDATOR[28] and NNSSP[30] by a simple majority wins method, the average three-state Q3prediction accuracy can be improved by 1% from 71.9% (PHD) to 72.9% on the CB396 set. A fair comparison of the accuracy of the constituent methods is only possible for PHD[27] and DSC[29] as all other algorithms included some of our test proteins in their training set. Despite this, PHD[27] still gave the highest accuracy on the new test set (71.9%) of any of the methods considered.

An automatic procedure for database searching to build a multiple sequence alignment has been developed. Alignments from this procedure give a 1.9% increase in the average accuracy of prediction compared to previous published results for the PHD algorithm on the 126 protein set[64]. The increase may be attributed to better alignments and the increased size of the current sequence databases.

In the literature there are different standards for reducing DSSP[38] 8-state (H,C,B,E,T,S,G,I) assignments to 3 states (H,C,E). It was found that changing the reduction method can alter the apparent prediction accuracy by over 3% on average. Although we were unable to train the methods using different 8 to 3 state reductions, testing all methods with different reduction methods showed that Method B[58] consistently gave higher accuracy. This may be attributed to Method B assigning more of the protein to Coil (C).

Secondary structure definition methods DSSP[38], DEFINE[39] and STRIDE[40] were compared. All three agree at only 75% of positions. This is mainly due to differences between DEFINE and DSSP/STRIDE. DSSP and STRIDE agree at 95% of positions, though DSSP defines many more 4 residue helices than STRIDE.

In summary, with the alignment method presented here, the method with the highest average accuracy on the new non-redundant test set of 396 proteins was PHD[64] with 71.9%. While the new combination of NNSSP[30], PHD [27], DSC[29] and PREDATOR[28] presented here improves upon this figure by 1% to 72.9%.

The non-redundant datasets constructed during this analysis will facilitate the future development and testing of secondary structure prediction methods. The datasets, alignments and definitions are available via http://barton.ebi.ac.uk.


next up previous contents
Next: Acknowledgements Up: No Title Previous: Improving the consensus prediction
james@ebi.ac.uk