[Dundee Uni] [Jpred]

The Barton Group

A consensus method for protein secondary structure prediction


Jpred distribution material

513_distribute.tar.gz

513 non redundant sequences, that can be used to test new secondary structure prediction methods. 396 sequences are derived from the 3Dee database of protein domains plus 117 proteins from the Rost and set of 126 non redundant proteins. All sequences in this set have been compared pairwise, and are non redundant to a 5SD cut-off.

The file contains definitions from the DSSP, DEFINE and STRIDE definition methods. No 8 to 3 state reduction is carried out on the definition data. Each protein also has a multiple sequence alignment associated with the target sequence. This alignment was built using the automatic alignment method within Jpred.

The format is as simple comma separated variable file e.g.:

DSSP:-,-,-,G,G,G,-,-,-,E,E,E,E,E,-,-,-,H,H,H,H,H,-,

No predictions from any method are included in this file, only definitions.

396_predictions_distribute.tar.gz This set contains the 396 predictions as used in the Cuff J. A. and Barton G. J., Proteins. (1999) paper. The Q3 accuracies were generated by taking G and B as Helix and Strand respectively. This file is also in 'concise' comma separated format.
406_distribute.tar.gz This set of 406 protein chains was used to validate the Jnet program in the 2000 Proteins paper.
CASP predictions Predictions for all the CASP targets that were not docking targets. These predictions were done during the CASP3 assessment, therefore all CASP3 targets are valid predictions, where as predictions from the other CASP's may be contaminated as the prediction methods may now have those structures in their databases.
CASP profile predictions tar.gz file BLOCK format file containing all the CASP predictions.
Jpred predictions

These are HTML rendered predictions for the current methods that are implemented in the Jpred server. The prediction accuracies for the 126 protein set will be artificially high as all the methods had this data in their training set. Only PHD was run in cross validated mode, for this set.

The 396 domain proteins that are of the form 1edmc-1-XXXX were used to obtain the results for Jpred. This sub set contains no sequence homology to the 126 protein set, and achived a Q3 accuracy of 72.9% for the consensus and 71.9% for PHD, the next best method. The 513 set shown here is composed of the 396 non redundant set plus 117 proteins of the 126 set that achieved a similarity score to the proteins in the 396 set that was lower than 5SD.