If a protein sequence shows clear similarity to a protein of known three dimensional structure, then the most accurate method of predicting the secondary structure is to align the sequences by standard dynamic programming algorithms, as homology modeling is much more accurate than secondary structure prediction for high levels of sequence identity. Secondary structure prediction methods are of most use when sequence similarity to a protein of known structure is undetectable. Accordingly, it is important that there is no detectable sequence similarity between sequences used to train and test secondary structure prediction methods.
Most secondary structure prediction methods include a set of parameters that must be estimated. Values for the parameters are obtained by statistical analysis or learning from a set of proteins for which the tertiary structure is known. This is the training set of proteins. Testing predictive accuracy on the training set leads to unrealistically high accuracies. An objective test of a secondary structure prediction method will predict the structures of a test set of proteins that are not in the training set and show no detectable sequence similarity with the training set. If the test is to be balanced, then both training an test sets should have a similar distribution of secondary structure classes and types.
Since the number of proteins of known structure is limited, it is normal to develop secondary structure prediction methods by cross-validation techniques, or jack-knife. In a full jack-knife test of N proteins, one protein is removed from the set, the parameters are developed on the remaining N-1 proteins, then the structure of the removed protein is predicted and its accuracy measured. This process is repeated N times by removing each protein in turn. Since some training techniques are very time consuming, a more limited cross-validation is often performed. The set of proteins might be split into M equally balanced subsets rather than N. Parameters are developed on (M-1)N/M proteins, then tested on the remaining N/M proteins. This process is repeated M times, once for each subset. As described the jack-knife process may also be referred to as a leave-one-out technique, although the two terminologies have become some what synonymous.
Cross-validation appears to remove the problem of a limited data set for training and test. However, artificially high accuracies can be obtained for some methods if the set of proteins used in the cross-validation show sequence similarity to each other. Accordingly, cross-validation sets must be pruned stringently to remove internal sequence similarities, or if this is not possible, then a completely independent test set must be used.
Selection of suitable test and training sets rests with the definition of 'undetectable' sequence similarity. Appropriate measures of sequence similarity are discussed in the following section.
There are now available 500 sequence dissimilar proteins of known three dimensional structure, suitable for developing and testing secondary structure prediction techniques. However, many of the current generation of secondary structure prediction methods were developed on a set of 126 protein chains proposed by Rost & Sander (referred to here as RS126). In this paper we develop a new, non-redundant set of 396 protein domains (the CB396 set) that does not include proteins from the RS126 set.