Rost & Sander (1993) selected 126 proteins with which to train and test secondary structure prediction algorithms[27]. They defined non-redundancy to mean that no two proteins in the set share more than 25% sequence identity over a length of more than 80 residues. Unfortunately, as shown below, the RS126 set contains pairs of proteins that are clearly sequence similar when compared by more sophisticated methods than percentage identity.
Percentage identity has long been known to be a poor measure of sequence similarity, particularly for values below 30%. Percentage identity is dependent upon both the length of the alignment[42] and the composition of the sequences. Thus, two sequences of similar unusual amino acid composition may give high values of percentage identity, even when unrelated.
Recently, deficiencies in percentage identity have been quantified by Brenner et al.(1998) when scoring protein sequence database searches[2]. Even with the length correction suggested by Sander & Schneider (1991)[42], percentage identity was significantly worse than measures that consider conservative substitutions as well as identities, and attempt corrections for length and composition.
Fortunately, techniques exist that overcome the deficiencies of
percentage identity or other simple measures of sequence similarity.
A long established method[43,6]
to measure the similarity between two protein sequences A and B is
first to align the proteins by a standard dynamic programming
algorithm (e.g. Needleman & Wunsch (1970)[44]) and obtain
the score for the alignment V. The order of amino acids in each
protein sequence is then randomised and a dynamic programming
alignment of the randomised sequences performed. This process is
repeated typically 100 or more times and the mean
and
standard deviation
of the scores for comparison of the
randomised sequences is calculated. The SD score, or Z score for
comparison of the native sequences is given by:
.
Unlike the percentage identity, SD score
corrects for bias due to the length and composition of the
sequences. Accordingly, we use SD scores to derive our non-redundant
test set of protein sequences.
PHD[27], NNSSP[30], DSC[29], and PREDATOR[28] have been trained on the Rost & Sander set of 126 proteins. The release versions of PREDATOR and NNSSP available for this analysis were trained on larger sets, that included the 126 proteins. In principle, this should give PREDATOR and NNSSP an advantage over PHD.
The sequences in the test set developed here came from the
3Dee[45] database of structural domain definitions. In 3Dee,
a non-redundant sequence set was created by the use of a sensitive
sequence comparison algorithm and cluster analysis, rather than a
simple percentage identity cutoff. This provided a set of 1233
domains where no pair shared obvious sequence similarity. The new test set
was derived from these domains by first removing multi-segment
domains, to reduce the set size from 1233 to 988 sequences. The
sequences were then filtered only to permit X-ray crystal structures
with resolutions of
2.5 Angstroms. This left a representative
set of 554 domain sequences, referred to as CB554.
To ensure that the CB554 domain set had no sequence similarity to the
RS126 set, the two sets were combined and all pairs of sequences
compared by AMPS[6] with a blosum62 matrix, and gap
penalty of 10. Alignments with an SD score of
5 were regarded
as sequence similar[6,46]. According to this
stringent definition of similarity, there were 11 sequence-similar
pairs within the RS126 protein set, 119 pairs between CB554 domain set
and RS126, and 21 pairs within CB554. Thus, there were 140 sequences
in CB554 that matched either a sequence in CB554, or in the RS126
protein set. Of the 140, 3 sequences matched more than once, leaving
137 unique sequences. The 137 sequences were removed from
CB554, leaving 417 sequences that were not sequence similar either
to any sequence within the set of 417 sequences, or the RS126 sequence
set. Of the 417 domain sequences remaining, 21 that did not have 'full DSSP
definitions', (i.e. those with more than 9 consecutive residues with
incomplete backbones
for which DSSP[38] does not define a state), were also removed,
leaving a test set of 396 proteins (CB396).
The process of deriving CB396 showed up homologies in the RS126 set,
with 11 proteins showing sequence similarity to at least one other
within the RS126 set.
These pairs are summarised in Table 1.
Table 1 shows each
pair to have the same fold according to SCOP[47].
For example, 4cms[48] and 5er2e[49] are present in the
RS126 set, yet have an SD score of 15.9. Both proteins are acid
proteases with an all ,
closed barrel structure.
Although not applied in this paper, three further non-redundant
datasets suitable for cross-validated training and testing of
secondary structure prediction methods were generated. The CB396 and
RS126 sequence sets were combined. One of each of the 11 pairs that
had an SD score of
5 were removed from the RS126 set. Since
2pcy and 1lhb matched more than one protein in this subset, this left
9 unique homologues (1mcpl[50], 1tgsi[51],
2lhb[52], 2pcy[53], 3ebx[54], 4cms[48],
4cpv[55], 5hvpa[56], 8abp[57]) that were removed
from RS126. This set added to CB396 gave CB513. Protein chains of
30 residues often do not have well defined secondary
structure. The CB497 set was constructed by removing the 16
domains from CB513 of
30 residues.
The 5SD cutoff used to derive the sets CB396, CB497 and CB513 is more
stringent than scores used in previous studies of secondary structure
prediction. However, although the SD score is a good measure of
pairwise sequence similarity, it still will not identify all known
homologues within the data set. In the SCOP[47] classification of
protein structure superfamilies are defined from careful
analysis of structure, evolution and function. The SCOP
superfamilies contain protein domains that have the same fold and are
likely to have evolved from a common ancestor. Accordingly, we
derived a further dataset from an analysis of all domains in SCOP_1.37.
We took a representative domain from each superfamily,
screened out multi-segment domains, NMR structures and those with a
resolution
2.5 Angstroms to give the CB251 dataset.
All datasets, including secondary structure definitions and automatically generated multiple sequence alignments will be distributed via http://barton.ebi.ac.uk/.