Analysis of the test and training alignments

Next: Alignment quality Up: Results and Discussion Previous: Comparison of secondary structure

Analysis of the test and training alignments

Table 4 summarises an analysis of the automatic multiple sequence alignments that were generated for the RS126 and CB396 sets. Both sets have a similar average length of sequence, and average percentage identity within the set. However, there is a significant difference between the average number of sequences per alignment between the two sets, even though both sets of alignments were generated using the same method. The older RS126 protein set has significantly (1.6 times) more sequence similar proteins in each alignment. The distribution of the number of sequences in the RS126 protein set was not biased by one or two large families.

A comparison between the CB396 set and the RS126 set showed the same distribution. The difference is therefore that each sequence family in the RS126 set is on average larger than any found in the CB396 set. This observation may simply reflect the fact that RS126 was derived from protein families whose first known members were characterised longer ago.

To verify that there was no bias to a particular structural class, the SCOP[47] classifications were examined for the proteins within the two sets as shown in Table 5. There is a higher proportion of small proteins in the RS126 protein set (14% against 7%), while the CB396 protein set has a higher proportion of $\alpha$ + $\beta$ proteins (26% against 13%). However the overall composition within each of the two sets is balanced.

Next: Alignment quality Up: Results and Discussion Previous: Comparison of secondary structure

james@ebi.ac.uk