This page provides more information about the SSPF Crystallisation Predictors (XANNpred, ParCrys & OB-Score). We envisage that prediction of crystallisation propensity would be particularly useful for selecting protein targets for structural studies by X-ray crystallography. This page was first written for people who use XANNpred, ParCrys and/or the OB-Score. Last updated 18/08/2009. |
XANNpred is a pair of Artificial Neural Networks (XANNpred-PDB, XANNpred-SG) based
on 428 features, including 20 amino acid and 400 dipeptide
frequencies, sequence length, predicted secondary structure, transmembrane regions, protein disorder, isoelectric point, hydrophobicity
and molecular weight. On the data
examined, XANNpred-PDB and XANNpred-SG
each outperform the other publicly available algorithms (XtalPred,
PXS, ParCrys and OB-Score). Proteins with XANNpred-PDB or XANNpred-SG
scores respectively above 0.517 or 0.418 are predicted to be "likely to produce diffraction-quality crystals
by current structural biology techniques".We suggest
that the XANNpred-SG algorithm may be most applicable
to "high-throughput" efforts (e.g. structural genomics consortia), while the XANNpred-PDB algorithm may be more relevant to the
structural biology community as a whole. This is because XANNpred-PDB predictions are based on PDB data, while
XANNpred-SG predictions are based on structural
genomics data. The XANNpred sliding window plots show
the XANNpred score against the centre position of a
61-residue sliding window over the input sequence. We suggest that these sliding
window plots may be helpful for construct design. If you use XANNpred, please cite "Overton, van Niekerk
& Barton (2009), XANNpred: Neural
Nets to Predict Diffraction-quality |
ParCrys is a Parzen Window approach based on calculated isoelectric point, hydrophobicity and the frequencies of S, C, G, F, Y, M residues; also see Overton, Padovani, Girolami & Barton (2008). Bioinformatics 24:901-907. The result "Recalcitrant" indicates that the input sequence is predicted to be "Recalcitrant to Crystallisation"; the result "Amenable" indicates that the input sequence is predicted to be "Amenable to Crystallisation"; the result "High-scoring" indicates that the input sequence falls into the "High-scoring Crystallisation Propensity Prediction" class. The threshold for "High-scoring" predictions was derived from maximising Matthews correlation coefficient over "real-world" data distributions (with an approximate 1:8 ratio of "diffraction-quality crystals":"work_stopped_before_crystals"). The "Amenable" threshold was derived by optimising accuracy over balanced data distributions (i.e. a 1:1 ratio of "diffraction-quality crystals":"work_stopped_before_crystals"). If you use ParCrys, please cite the above reference. We find that ParCrys performs well over data from structural genomics consortia (e.g. TargetDB) - see below |
The OB-Score is a Z-score scale based on calculated isoelectric point and hydrophobicity; also see Overton & Barton (2006). FEBS Lett. 580, 4005-4009 |
Positive OB-Score values indicate the input is more similar to crystallised (PDB) sequences than the average (UniRef50) protein sequence in terms of calculated isoelectric point and hydrophobicity. Conversely, negative OB-Scores indicate the input is more different to known crystallised sequences than average in terms of isoelectric point and hydrophobicity. We found that an OB-Score threshold value of 0.809 maximised predictive accuracy over a balanced dataset from TargetDB. An OB-Score threshold to define a "High-scoring" class could a value around 5 or 6 (see histogram below), however an optimal Matthews correlation coefficient on the "real-world" data (having 1:8 ratio of "diffraction-quality crystals":"work_stopped_before_crystals") is found for an OB-Score threshold of 1.5. If you use OB-Score, please cite the above reference. OB-Score predictions are rapidly calculated and therefore easily produced over large datasets. The OB-Score algorithm and associated data are available for download from here |
The seqence statistics reported
along with OB-Score and ParCrys results are
calculated as follows: GRAVY (GRand AVerage of hydrophobicitY) is derived with the |
The graph below compares ParCrys
and the OB-Score with other publicly available methods, SECRET and CRYSTALP.
The graph shows a Receiver Operator Characteristic (ROC) curve for the methods.
ParCrys was found to outperform the other methods on
the data examined. See below for additional explanation and further results.
Over the TEST-RL dataset, ParCrys has been found to
have an accuracy of 79.1% in comparisons with the OB-SCORE (69.8%), SECRET
(58.1%) and CRYSTALP (46.5%). TEST-RL comprises 86 non-redundant blind test
sequences with length 46-200. The length restriction was necessary for
comparison to SECRET and CRYSTALP, because these methods only accept as input
sequences of length 46-200. |
![[ROC Evaluation Of Test Data]](Guide_files/image005.gif)
The independent test sets TEST and TEST-RL both contained equal numbers of
positive and negative examples. However, from analyses of TargetDB
(and PepcDB) we have found a ratio of approximately 8
"work stopped before crystals" (negative) sequences to 1 "diffraction-quality
crystals" (positive) sequence. Therefore a new threshold for ParCrys was determined by optimising the Matthews
correlation coefficient with 728 positive (TDB_DIF) and 6025 negative (TDB_WS)
sequences; this threshold is used to define "High-scoring" proteins. The non-redundant
blind test data comprised 72 positive (T_ |
The following histogram gives the OB-Score distribution for TargetDB sequences associated with diffraction quality crystals (TDB_DIF), and the distribution for sequences where work has been stopped before crystals were obtained (TDB_WS). The set of sequences associated with diffraction-quality crystals (TDB_DIF) are significantly enriched in high OB-Scores, compared to the sequences where work was stopped before crystals were obtained (TDB_WS). Also, the set of "work stopped" (TDB_WS) sequences are significantly enriched in low OB-Scores compared to the "diffraction-quality crystals" (TDB_DIF) sequences. |
![[OB-Score Distribution]](Guide_files/image006.jpg)
Please contact geoff "at" compbio "dot" dundee "dot" ac "dot" uk
This work was funded by BBSRC