Introduction

This page gives information about the SSPF Crystallisation Propensity Predictors (XANNpred, ParCrys & OB-Score). We envisage that these algorithms would be particularly useful in selecting protein targets for structural studies by X-ray crystallography. This page was first written for people who use these predictors

Last updated 2/5/2011.

XANNpred Predictions

XANNpred is a pair of Artificial Neural Networks (XANNpred-PDB, XANNpred-SG). Proteins with XANNpred-PDB or XANNpred-SG scores respectively above 0.517 or 0.418 are predicted to be "likely to produce diffraction-quality crystals by current structural biology techniques".We suggest that the XANNpred-SG algorithm may be most applicable to "high-throughput" efforts (e.g. structural genomics consortia), while the XANNpred-PDB algorithm may be more relevant to the structural biology community as a whole. This is because XANNpred-PDB predictions are based on PDB data, while XANNpred-SG predictions are based on structural genomics data. XANNpred utilizes 428 features, including 20 amino acid and 400 dipeptide frequencies, sequence length, predicted secondary structure, transmembrane regions, protein disorder, isoelectric point, hydrophobicity and molecular weight.
On the data examined, XANNpred-PDB and XANNpred-SG each outperform the other publicly available algorithms (XtalPred, PXS, ParCrys and OB-Score). XANNpred results include sliding window plots to show the XANNpred score against the centre position of a 61-residue sliding window over the input sequence. We suggest that these sliding window plots may be helpful for construct design. If you use XANNpred, please cite: Overton, I.M., van Niekerk, C.A.J., and Barton, G.J. (2011), XANNpred: Neural nets that predict the propensity of a protein to yield diffraction-quality crystals. Proteins 79, 1027-1033 . XANNpred predictions are accessible from here

ParCrys Predictions

ParCrys is a Parzen Window approach based on calculated isoelectric point, hydrophobicity and the frequencies of S, C, G, F, Y, M residues; also see Overton,Padovani, Girolami & Barton (2008). Bioinformatics 24:901-907. The result "Recalcitrant" indicates that the input sequence is predicted to be "Recalcitrant to Crystallisation"; the result "Amenable" indicates that the input sequence is predicted to be "Amenable to Crystallisation"; the result "High-scoring" indicates that the input sequence falls into the "High-scoring Crystallisation Propensity Prediction" class. The threshold for "High-scoring" predictions was derived from maximising Matthews correlation coefficient over "real-world" data distributions (with an approximate 1:8 ratio of "diffraction-quality crystals":"work_stopped_before_crystals"). The "Amenable" threshold was derived by optimising accuracy over balanced data distributions (i.e. a 1:1 ratio of "diffraction-quality crystals":"work_stopped_before_crystals"). If you use ParCrys, please cite the above reference. We find that ParCrys performs well over data from structural genomics consortia (e.g. TargetDB) - see below

OB-Score Predictions

The OB-Score is a Z-score scale based on calculated isoelectric point and hydrophobicity; also see Overton & Barton (2006). FEBS Lett. 580, 4005-4009

Positive OB-Score values indicate the input is more similar to crystallised (PDB) sequences than average in terms of calculated isoelectric point and hydrophobicity. Conversely, negative values indicate the input is more different to PDB sequences than average. We found that an OB-Score of 0.809 maximised predictive accuracy over a balanced dataset from TargetDB. Values to define a "High-scoring" class could be 5 or 6 (see histogram below), however an optimal Matthews correlation coefficient on the "real-world" data (1:8 ratio of "diffraction-quality crystals":"work_stopped_before_crystals") corresponds to OB-Score of 1.5. If you use OB-Score, please cite the above reference. OB-Score predictions are rapidly calculated and therefore easily produced over large datasets. The algorithm and associated data are available for download from here

Sequence Statistics (OB-Score and ParCrys)

The seqence statistics reported along with OB-Score and ParCrys results are calculated as follows: GRAVY (GRand AVerage of hydrophobicitY) is derived with the GES scale, pI (isoelectric point) is derived using BioPerl with EMBOSS defined pKa values. Sequence length reflects the length of the cleaned input sequence - non-amino acid characters are removed prior to all calculations; "B", "Z", "J", "O" and "U" are transformed to "X".

Evaluation of ParCrys and OB-Score (TargetDB Non-redundant Blind Test Data)

The graph below shows Receiver Operator Characteristic (ROC) curves for comparison of ParCrys and the OB-Score with other publicly available methods, SECRET and CRYSTALP. ParCrys was found to perform best on the data examined. See below for additional explanation and further results. Over the TEST-RL dataset, ParCrys had accuracy of 79.1%, outperforming OB-Score (69.8%), SECRET (58.1%) and CRYSTALP (46.5%). TEST-RL comprises 86 non-redundant blind test sequences with length 46-200. The length restriction was necessary for comparison to SECRET and CRYSTALP, because these methods only accept as input sequences of length 46-200.
The TEST dataset was not restricted by length, and comprised 144 non-redundant blind test sequences (with length range 42-1169). The ParCrys and OB-Score accuracy values over TEST were 71.5% and 64.6%, respectively. Stringent filtering criteria were applied in generating the independent test datasets to remove overlap with data used in method development and to remove redundancy within the test datasets. However, these test data were taken from TargetDB, which is biased due to the selection criteria applied by structural genomics consortia.

The independent test sets TEST and TEST-RL both contained equal numbers of positive and negative examples. However, from analyses of TargetDB (and PepcDB) we have found a ratio of approximately 8 "work stopped before crystals" (negative) sequences to 1 "diffraction-quality crystals" (positive) sequence. Therefore a new threshold for ParCrys was determined by optimising the Matthews correlation coefficient with 728 positive (TDB_DIF) and 6025 negative (TDB_WS) sequences; this threshold is used to define "High-scoring" proteins. The non-redundant blind test data comprised 72 positive (T_POS72) and 610 negative (T_NEG610) sequences. Using T_POS72 and T_NEG610 and the threshold defined over TDB_DIF and TDB_WS we found an accuracy of 74.0%, with associated area under the receiver operator characteristic curve of 0.738. The datasets used in the development and evaluation of ParCrys can be found here

Histogram for OB-Scores (TargetDB Data)

The following histogram gives the OB-Score distribution for TargetDB sequences associated with diffraction quality crystals (TDB_DIF), and the distribution for sequences where work has been stopped before crystals were obtained (TDB_WS). The set of sequences associated with diffraction-quality crystals (TDB_DIF) are significantly enriched in high OB-Scores, compared to the sequences where work was stopped before crystals were obtained (TDB_WS). Also, the set of "work stopped" (TDB_WS) sequences are significantly enriched in low OB-Scores compared to the "diffraction-quality crystals" (TDB_DIF) sequences.

For more details see Overton, Padovani, Girolami & Barton (2008)."ParCrys: A Parzen Window Density Estimation Approach to Protein Crystallisation Propensity Prediction." Bioinformatics 24:901-907, and/or Overton & Barton (2006) "A normalised scale for structural genomics target ranking: The OB-Score." FEBS Lett. 580, 4005-4009

Please contact gjbarton "at" dundee "dot" ac "dot" uk

This work was funded by BBSRC