Training and testing protein sets

Next: Blind test Up: Methods Previous: Methods

Training and testing protein sets

For development of the methods, 513 proteins from a previous study [34] were screened to remove proteins that were shorter than 30 residues, and those from families that contained only 2 sequences and so did not generate valid PSIBLAST alignment profiles. This left 480 proteins to use for cross-validated training of the new methods. Removing the sequence orphans may extend the overall average accuracy of any prediction method. However, all the prediction methods studied here were tested on the same multiple sequence alignments that were not used in training the methods. As a consequence, unlike in earlier work [34] a direct comparison of performance between methods was possible.

The 480 training proteins were selected by a stringent definition of sequence similarity [34]. As such, these proteins may be split to generate training and testing sets for prediction, with minimal concern that the test and training sets will be contaminated with proteins of similar sequence. In this work, the data was split randomly into 7 sets to perform cross-validation tests.

James Cuff
2001-06-29