Improving the consensus prediction

Next: Summary and Conclusions Up: Results and Discussion Previous: Single sequence prediction methods

Improving the consensus prediction

In order to establish the upper limit of accuracy possible by combining the prediction methods, we took the most accurate prediction for each residue in the RS126 data set by PHD, DSC, NNSSP or PREDATOR. This gave the theoretical best accuracy for a combination of these methods of Q₃ = 78%.

We investigated a variety of techniques for combining the prediction methods, in an attempt to raise the average Q₃ on RS126 from 74.8% towards 78%. All possible combinations of methods were tried to calculate the consensus, but no combination of methods improved upon the average Q₃ of the consensus of DSC, PREDATOR, NNSSP and PHD, with PHD taken if there was a tie. However, the next highest combination was only 0.3% worse at 74.2% and used NNSSP, PREDATOR and DSC, predictions relying on PREDATOR's definition if there was no consensus. Experiments with filtering single residue helix predictions and other unlikely secondary structures did not improve the overall Q₃.

The reliability information from the PHD and PREDATOR predictions was also investigated. When a method predicted with a reliability of greater than 7, that prediction was taken. No further increase in average Q₃ accuracy could be achieved using this approach.

The predictions for each method were weighted by adding constants. All combinations of all values from 1 to 10 were applied to all predictions for each method. The consensus was then calculated in the same manner as before, but now using the weighted predictions. The optimal weighting scheme was 2,1,2,2 where PREDATOR was down weighted by one point. The Q₃ accuracy for this approach was no higher than that of the non weighted majority wins method.

An artificial neural network, with 9 hidden nodes was trained with the output from the NNSSP, PHD, DSC and PREDATOR methods. A 17 residue window was used. The inputs were coded as binary, with 001, 010 and 100 representing the helix, strand and coil states respectively. Seven fold cross validation was performed. This yielded 73.2% for the 126 protein set. This result was still lower than the simple consensus approach. No further improvement in accuracy was seen by changing the free parameters of the network, for example, hidden nodes or number of training epochs. The target sequence was also included in the input layer, but this also proved unsuccessful. We suggest that better accuracies may be achieved if propensities for the different states are used, rather than the binary input, and this idea forms the basis of future work.

When the lower accuracy predictions from ZPRED, MULPRED were included, the overall accuracy of the consensus method was reduced. SIMPA, SOPM and GORIV were not included at any stage in the consensus method. Further work aims to discover if the single sequence prediction methods can be incorporated into a more accurate consensus method.

Next: Summary and Conclusions Up: Results and Discussion Previous: Single sequence prediction methods

james@ebi.ac.uk