Training and test sets of protein structures

Next: Generating the multiple sequence Up: Methods Previous: The problem of objectively

Training and test sets of protein structures

Rost & Sander (1993) selected 126 proteins with which to train and test secondary structure prediction algorithms[27]. They defined non-redundancy to mean that no two proteins in the set share more than 25% sequence identity over a length of more than 80 residues. Unfortunately, as shown below, the RS126 set contains pairs of proteins that are clearly sequence similar when compared by more sophisticated methods than percentage identity.

Percentage identity has long been known to be a poor measure of sequence similarity, particularly for values below 30%. Percentage identity is dependent upon both the length of the alignment[42] and the composition of the sequences. Thus, two sequences of similar unusual amino acid composition may give high values of percentage identity, even when unrelated.

Recently, deficiencies in percentage identity have been quantified by Brenner et al.(1998) when scoring protein sequence database searches[2]. Even with the length correction suggested by Sander & Schneider (1991)[42], percentage identity was significantly worse than measures that consider conservative substitutions as well as identities, and attempt corrections for length and composition.

Fortunately, techniques exist that overcome the deficiencies of percentage identity or other simple measures of sequence similarity. A long established method[43,6] to measure the similarity between two protein sequences A and B is first to align the proteins by a standard dynamic programming algorithm (e.g. Needleman & Wunsch (1970)[44]) and obtain the score for the alignment V. The order of amino acids in each protein sequence is then randomised and a dynamic programming alignment of the randomised sequences performed. This process is repeated typically 100 or more times and the mean $\bar{x}$ and standard deviation $\sigma$ of the scores for comparison of the randomised sequences is calculated. The SD score, or Z score for comparison of the native sequences is given by: $\frac{V-\bar{x}}{\sigma}$ . Unlike the percentage identity, SD score corrects for bias due to the length and composition of the sequences. Accordingly, we use SD scores to derive our non-redundant test set of protein sequences.

PHD[27], NNSSP[30], DSC[29], and PREDATOR[28] have been trained on the Rost & Sander set of 126 proteins. The release versions of PREDATOR and NNSSP available for this analysis were trained on larger sets, that included the 126 proteins. In principle, this should give PREDATOR and NNSSP an advantage over PHD.

The sequences in the test set developed here came from the 3Dee[45] database of structural domain definitions. In 3Dee, a non-redundant sequence set was created by the use of a sensitive sequence comparison algorithm and cluster analysis, rather than a simple percentage identity cutoff. This provided a set of 1233 domains where no pair shared obvious sequence similarity. The new test set was derived from these domains by first removing multi-segment domains, to reduce the set size from 1233 to 988 sequences. The sequences were then filtered only to permit X-ray crystal structures with resolutions of $\le$ 2.5 Angstroms. This left a representative set of 554 domain sequences, referred to as CB554.

To ensure that the CB554 domain set had no sequence similarity to the RS126 set, the two sets were combined and all pairs of sequences compared by AMPS[6] with a blosum62 matrix, and gap penalty of 10. Alignments with an SD score of $\ge$ 5 were regarded as sequence similar[6,46]. According to this stringent definition of similarity, there were 11 sequence-similar pairs within the RS126 protein set, 119 pairs between CB554 domain set and RS126, and 21 pairs within CB554. Thus, there were 140 sequences in CB554 that matched either a sequence in CB554, or in the RS126 protein set. Of the 140, 3 sequences matched more than once, leaving 137 unique sequences. The 137 sequences were removed from CB554, leaving 417 sequences that were not sequence similar either to any sequence within the set of 417 sequences, or the RS126 sequence set. Of the 417 domain sequences remaining, 21 that did not have 'full DSSP definitions', (i.e. those with more than 9 consecutive residues with incomplete backbones for which DSSP[38] does not define a state), were also removed, leaving a test set of 396 proteins (CB396).

The process of deriving CB396 showed up homologies in the RS126 set, with 11 proteins showing sequence similarity to at least one other within the RS126 set. These pairs are summarised in Table 1. Table 1 shows each pair to have the same fold according to SCOP[47]. For example, 4cms[48] and 5er2e[49] are present in the RS126 set, yet have an SD score of 15.9. Both proteins are acid proteases with an all $\beta$ , closed barrel structure.

Although not applied in this paper, three further non-redundant datasets suitable for cross-validated training and testing of secondary structure prediction methods were generated. The CB396 and RS126 sequence sets were combined. One of each of the 11 pairs that had an SD score of $\ge$ 5 were removed from the RS126 set. Since 2pcy and 1lhb matched more than one protein in this subset, this left 9 unique homologues (1mcpl[50], 1tgsi[51], 2lhb[52], 2pcy[53], 3ebx[54], 4cms[48], 4cpv[55], 5hvpa[56], 8abp[57]) that were removed from RS126. This set added to CB396 gave CB513. Protein chains of $\le$ 30 residues often do not have well defined secondary structure. The CB497 set was constructed by removing the 16 domains from CB513 of $\le$ 30 residues.

The 5SD cutoff used to derive the sets CB396, CB497 and CB513 is more stringent than scores used in previous studies of secondary structure prediction. However, although the SD score is a good measure of pairwise sequence similarity, it still will not identify all known homologues within the data set. In the SCOP[47] classification of protein structure superfamilies are defined from careful analysis of structure, evolution and function. The SCOP superfamilies contain protein domains that have the same fold and are likely to have evolved from a common ancestor. Accordingly, we derived a further dataset from an analysis of all domains in SCOP_1.37. We took a representative domain from each superfamily, screened out multi-segment domains, NMR structures and those with a resolution $\ge$ 2.5 Angstroms to give the CB251 dataset.

All datasets, including secondary structure definitions and automatically generated multiple sequence alignments will be distributed via http://barton.ebi.ac.uk/.

Next: Generating the multiple sequence Up: Methods Previous: The problem of objectively

james@ebi.ac.uk