Alignments

Next: Filtered sequence database Up: Methods Previous: Blind test

Alignments

For each of the 480 training-set sequences, a multiple sequence alignment was constructed. For comparison, both BLAST and PSIBLAST were used to search the SWALL [43] non redundant protein sequence database, with a p-value cutoff of 0.0001. For PSIBLAST, 3 iterations were applied to search the sequence database. For each of the sequences found, the method described previously [34] was applied to generate multiple sequence alignments. To compare the effect of different multiple sequence alignment methods, AMPS [42] and CLUSTALW [44] were both used. CLUSTALW [44] was executed with default parameters while for AMPS[42], a BLOSUM62 matrix, and gap penalty of 10 were applied.

The alignments were represented as profiles for input to the neural network and the profiles were scored in three ways:

1.: As frequency counts for each amino acid down a column in the alignment, expressed as a percentage of the total for a given column. This is the same approach as used by the PHD algorithm [17].
2.: Each residue in an alignment column was scored by its corresponding BLOSUM62 matrix score. The scores were then averaged based on the number of sequences in that column as in (1.). This stopped each residue having an equal weight, instead using a weight based on that residue's mutation score.
3.: As a position specific profile, generated by the HMMER2 [45] package. The multiple sequence alignment is represented as a profile HMM [46,47], with position specific scores to represent amino acids in the alignment.

Figure 2 summarises an attempt to improve the alignments obtained from PSIBLAST by post-processing the result of the PSIBLAST search. As shown in Figure 2 full length sequences were taken from the PSIBLAST search, the alignment was then constructed by making successive global alignments to the profile by adding sequences in the order determined by the p-value scores from the initial PSIBLAST sequence search. At each iteration the ends of the alignment were trimmed, to force the global alignment method to represent the query sequence.

In addition to the method summarised in Figure 2 each of the PSIBLAST alignments were also represented by the profiles in the PSIBLAST report file. Two profiles were extracted, the simple frequency counts (denoted in the PSIBLAST report as position characters, multiplied by 10 and rounded), and that denoted as the position-based scoring matrix.

Next: Filtered sequence database Up: Methods Previous: Blind test

James Cuff
2001-06-29