Generating the multiple sequence alignments

Next: Prediction methods analysed Up: Methods Previous: Training and test sets

Generating the multiple sequence alignments

With the exception of PREDATOR[58,28] all methods considered here, required a multiple sequence alignment as input, where as PREDATOR only required the multiple sequences in an unaligned format. In order to simplify the generation of multiple sequence alignments for large numbers of proteins, in this study we developed an automatic procedure.

We first perform a BLAST[59] database search of the OWL v29.4 database, which contains 198,742 entries[60]. The BLAST output is then screened by SCANPS, an implementation of the Smith Waterman dynamic programming algorithm[61,62], with length dependent statistics. Sequences are rejected if their SCANPS probability score is higher than 1x10^-4. Sequences are also rejected if they do not fit a length cutoff of 1.5. For example, if the query sequence is 90 residues long, the sequence length would have to range between 60 and 135 residues to be included. If sequences exceed the length criterion, they are truncated by removing end residues until the length of the sequence satisfies the cut off value. Sequences falling short of the lower length limit are discarded. The value of 1.5 for the length cutoff was reached by visual inspection of a number of multiple sequence alignments, produced with different cut-off values. The method removes both ridiculously long, short and unrelated sequences. However it does allow sequences that are longer than the query, and are related, to be included after truncation. The sequence similar proteins selected by this method, are then aligned by CLUSTALW (version 1.7)[63], with default parameters.

The multiple sequence alignments are modified so that they do not contain gaps in the first or 'query' sequence, since with the current algorithms, gaps in the first sequence tend to reduce the accuracy of the prediction, or cause the program to fail to execute (NNSSP[30]). A slightly different method is used for PHD[64], whereby only gaps at the end of the target sequence are removed. Without this modification, the conversion of MSF to HSSP file format fails, as a correct insertion table is not constructed.

The reference secondary structure for each domain was defined by DSSP[38], STRIDE[40] and DEFINE[39]. All definitions were reduced to 3 state models, as follows:

1.: DSSP: H and G to H, E and B to E, all other states to C
2.: STRIDE: H and G to H, E and b to E, all other states to C
3.: DEFINE: H and G to H, E to E, all other states to C

Where H is $\alpha$ -helix, G is ${\rm 3_{10}}$ -helix, B and b are isolated $\beta$ -bridge and E is $\beta$ -strand.

The effect of alternative reduction methods for the DSSP algorithm is discussed in the results section.

Next: Prediction methods analysed Up: Methods Previous: Training and test sets

james@ebi.ac.uk