[Dundee University]
OB-Score and ParCrys XANNpred
THE BARTON GROUP

Introduction

This page provides more information about the SSPF Crystallisation Predictors (XANNpred, ParCrys & OB-Score). We envisage that prediction of crystallisation propensity would be particularly useful for selecting protein targets for structural studies by X-ray crystallography. This page was first written for people who use XANNpred, ParCrys and/or the OB-Score. Last updated 18/08/2009.

XANNpred Predictions

XANNpred is a pair of Artificial Neural Networks (XANNpred-PDB, XANNpred-SG) based on 428 features, including 20 amino acid and 400 dipeptide frequencies, sequence length, predicted secondary structure, transmembrane regions, protein disorder, isoelectric point, hydrophobicity and molecular weight. On the data examined, XANNpred-PDB and XANNpred-SG each outperform the other publicly available algorithms (XtalPred, PXS, ParCrys and OB-Score). Proteins with XANNpred-PDB or XANNpred-SG scores respectively above 0.517 or 0.418 are predicted to be "likely to produce diffraction-quality crystals by current structural biology techniques".We suggest that the XANNpred-SG algorithm may be most applicable to "high-throughput" efforts (e.g. structural genomics consortia), while the XANNpred-PDB algorithm may be more relevant to the structural biology community as a whole. This is because XANNpred-PDB predictions are based on PDB data, while XANNpred-SG predictions are based on structural genomics data. The XANNpred sliding window plots show the XANNpred score against the centre position of a 61-residue sliding window over the input sequence. We suggest that these sliding window plots may be helpful for construct design. If you use XANNpred, please cite "Overton, van Niekerk & Barton (2009), XANNpred: Neural Nets to Predict Diffraction-quality Crystals (in preparation)". XANNpred predictions are accessible from here

ParCrys Predictions

ParCrys is a Parzen Window approach based on calculated isoelectric point, hydrophobicity and the frequencies of S, C, G, F, Y, M residues; also see Overton, Padovani, Girolami & Barton (2008). Bioinformatics 24:901-907. The result "Recalcitrant" indicates that the input sequence is predicted to be "Recalcitrant to Crystallisation"; the result "Amenable" indicates that the input sequence is predicted to be "Amenable to Crystallisation"; the result "High-scoring" indicates that the input sequence falls into the "High-scoring Crystallisation Propensity Prediction" class. The threshold for "High-scoring" predictions was derived from maximising Matthews correlation coefficient over "real-world" data distributions (with an approximate 1:8 ratio of "diffraction-quality crystals":"work_stopped_before_crystals"). The "Amenable" threshold was derived by optimising accuracy over balanced data distributions (i.e. a 1:1 ratio of "diffraction-quality crystals":"work_stopped_before_crystals"). If you use ParCrys, please cite the above reference. We find that ParCrys performs well over data from structural genomics consortia (e.g. TargetDB) - see below

OB-Score Predictions

The OB-Score is a Z-score scale based on calculated isoelectric point and hydrophobicity; also see Overton & Barton (2006). FEBS Lett. 580, 4005-4009

Positive OB-Score values indicate the input is more similar to crystallised (PDB) sequences than the average (UniRef50) protein sequence in terms of calculated isoelectric point and hydrophobicity. Conversely, negative OB-Scores indicate the input is more different to known crystallised sequences than average in terms of isoelectric point and hydrophobicity. We found that an OB-Score threshold value of 0.809 maximised predictive accuracy over a balanced dataset from TargetDB. An OB-Score threshold to define a "High-scoring" class could a value around 5 or 6 (see histogram below), however an optimal Matthews correlation coefficient on the "real-world" data (having 1:8 ratio of "diffraction-quality crystals":"work_stopped_before_crystals") is found for an OB-Score threshold of 1.5. If you use OB-Score, please cite the above reference. OB-Score predictions are rapidly calculated and therefore easily produced over large datasets. The OB-Score algorithm and associated data are available for download from here

Sequence Statistics (OB-Score and ParCrys)

The seqence statistics reported along with OB-Score and ParCrys results are calculated as follows: GRAVY (GRand AVerage of hydrophobicitY) is derived with the GES scale, pI (isoelectric point) is derived using BioPerl with EMBOSS defined pKa values, and sequence length reflects the length of the cleaned input sequence. Sequence cleaning is done prior to all calculations, and involves the removal of non-amino acid characters; the letters "B", "Z", "J", "O" and "U" are transformed to "X".

Evaluation of ParCrys and OB-Score (TargetDB Non-redundant Blind Test Data)

The graph below compares ParCrys and the OB-Score with other publicly available methods, SECRET and CRYSTALP. The graph shows a Receiver Operator Characteristic (ROC) curve for the methods. ParCrys was found to outperform the other methods on the data examined. See below for additional explanation and further results. Over the TEST-RL dataset, ParCrys has been found to have an accuracy of 79.1% in comparisons with the OB-SCORE (69.8%), SECRET (58.1%) and CRYSTALP (46.5%). TEST-RL comprises 86 non-redundant blind test sequences with length 46-200. The length restriction was necessary for comparison to SECRET and CRYSTALP, because these methods only accept as input sequences of length 46-200.
The TEST dataset was not restricted by length, and comprised 144 non-redundant blind test sequences (with length range 42-1169). The ParCrys and OB-Score accuracy values over TEST were 71.5% and 64.6%, respectively. Stringent filtering criteria were applied in generating the independent test datasets to remove overlap with data used in method development and to remove redundancy within the test datasets. However, we note that these test data were taken from TargetDB, which is biased due to the selection criteria applied by structural genomics consortia.

[ROC Evaluation Of Test Data]

The independent test sets TEST and TEST-RL both contained equal numbers of positive and negative examples. However, from analyses of TargetDB (and PepcDB) we have found a ratio of approximately 8 "work stopped before crystals" (negative) sequences to 1 "diffraction-quality crystals" (positive) sequence. Therefore a new threshold for ParCrys was determined by optimising the Matthews correlation coefficient with 728 positive (TDB_DIF) and 6025 negative (TDB_WS) sequences; this threshold is used to define "High-scoring" proteins. The non-redundant blind test data comprised 72 positive (T_POS72) and 610 negative (T_NEG610) sequences. Using T_POS72 and T_NEG610 and the threshold defined over TDB_DIF and TDB_WS we found an accuracy of 74.0%, with associated area under the receiver operator characteristic curve of 0.738. The datasets used in the development and evaluation of ParCrys can be found here

Histogram for OB-Scores (TargetDB Data)

The following histogram gives the OB-Score distribution for TargetDB sequences associated with diffraction quality crystals (TDB_DIF), and the distribution for sequences where work has been stopped before crystals were obtained (TDB_WS). The set of sequences associated with diffraction-quality crystals (TDB_DIF) are significantly enriched in high OB-Scores, compared to the sequences where work was stopped before crystals were obtained (TDB_WS). Also, the set of "work stopped" (TDB_WS) sequences are significantly enriched in low OB-Scores compared to the "diffraction-quality crystals" (TDB_DIF) sequences.

 

 

[OB-Score Distribution]

For more details see Overton, Padovani, Girolami & Barton (2008)."ParCrys: A Parzen Window Density Estimation Approach to Protein Crystallisation Propensity Prediction." Bioinformatics 24:901-907, and/or Overton & Barton (2006) "A normalised scale for structural genomics target ranking: The OB-Score." FEBS Lett. 580, 4005-4009

 

Please contact geoff "at" compbio "dot" dundee "dot" ac "dot" uk

 

This work was funded by BBSRC

 

[BBSRC]