|
|||||
This page gives information about the SSPF Crystallisation Propensity Predictors (XANNpred, ParCrys & OB-Score). We envisage that these algorithms would be particularly useful in selecting protein targets for structural studies by X-ray crystallography. This page was first written for people who use these predictors Last updated 2/5/2011. |
XANNpred
is a pair of Artificial Neural Networks (XANNpred-PDB,
XANNpred-SG). Proteins with XANNpred-PDB or XANNpred-SG scores
respectively above 0.517 or 0.418 are predicted to be "likely
to produce diffraction-quality crystals by current structural
biology techniques".We suggest that the
XANNpred-SG algorithm may be most applicable to "high-throughput"
efforts (e.g. structural genomics consortia), while the
XANNpred-PDB algorithm may be more relevant to the structural
biology community as a whole. This is because XANNpred-PDB
predictions are based on PDB data, while XANNpred-SG predictions
are based on structural genomics data. XANNpred utilizes 428
features, including 20 amino acid and 400 dipeptide frequencies,
sequence length, predicted secondary structure, transmembrane
regions, protein disorder, isoelectric point, hydrophobicity and
molecular weight. |
ParCrys is a Parzen Window approach based on calculated isoelectric point, hydrophobicity and the frequencies of S, C, G, F, Y, M residues; also see Overton,Padovani, Girolami & Barton (2008). Bioinformatics 24:901-907. The result "Recalcitrant" indicates that the input sequence is predicted to be "Recalcitrant to Crystallisation"; the result "Amenable" indicates that the input sequence is predicted to be "Amenable to Crystallisation"; the result "High-scoring" indicates that the input sequence falls into the "High-scoring Crystallisation Propensity Prediction" class. The threshold for "High-scoring" predictions was derived from maximising Matthews correlation coefficient over "real-world" data distributions (with an approximate 1:8 ratio of "diffraction-quality crystals":"work_stopped_before_crystals"). The "Amenable" threshold was derived by optimising accuracy over balanced data distributions (i.e. a 1:1 ratio of "diffraction-quality crystals":"work_stopped_before_crystals"). If you use ParCrys, please cite the above reference. We find that ParCrys performs well over data from structural genomics consortia (e.g. TargetDB) - see below |
The OB-Score is a Z-score scale based on calculated isoelectric point and hydrophobicity; also see Overton & Barton (2006). FEBS Lett. 580, 4005-4009 |
Positive OB-Score values indicate the input is more similar to crystallised (PDB) sequences than average in terms of calculated isoelectric point and hydrophobicity. Conversely, negative values indicate the input is more different to PDB sequences than average. We found that an OB-Score of 0.809 maximised predictive accuracy over a balanced dataset from TargetDB. Values to define a "High-scoring" class could be 5 or 6 (see histogram below), however an optimal Matthews correlation coefficient on the "real-world" data (1:8 ratio of "diffraction-quality crystals":"work_stopped_before_crystals") corresponds to OB-Score of 1.5. If you use OB-Score, please cite the above reference. OB-Score predictions are rapidly calculated and therefore easily produced over large datasets. The algorithm and associated data are available for download from here |
The seqence statistics reported along with OB-Score and ParCrys results are calculated as follows: GRAVY (GRand AVerage of hydrophobicitY) is derived with the GES scale, pI (isoelectric point) is derived using BioPerl with EMBOSS defined pKa values. Sequence length reflects the length of the cleaned input sequence - non-amino acid characters are removed prior to all calculations; "B", "Z", "J", "O" and "U" are transformed to "X". |
The graph below shows Receiver Operator Characteristic (ROC)
curves for comparison of ParCrys and the
OB-Score with other publicly available methods, SECRET and
CRYSTALP. ParCrys was found to perform best on the data examined. See below for additional explanation
and further results. Over the TEST-RL dataset, ParCrys had accuracy of 79.1%, outperforming
OB-Score (69.8%), SECRET (58.1%) and CRYSTALP (46.5%). TEST-RL
comprises 86 non-redundant blind test sequences with length
46-200. The length restriction was necessary for comparison to
SECRET and CRYSTALP, because these methods only accept as input
sequences of length 46-200. |
|
The independent test sets TEST and TEST-RL both contained equal numbers of positive and negative examples. However, from analyses of TargetDB (and PepcDB) we have found a ratio of approximately 8 "work stopped before crystals" (negative) sequences to 1 "diffraction-quality crystals" (positive) sequence. Therefore a new threshold for ParCrys was determined by optimising the Matthews correlation coefficient with 728 positive (TDB_DIF) and 6025 negative (TDB_WS) sequences; this threshold is used to define "High-scoring" proteins. The non-redundant blind test data comprised 72 positive (T_POS72) and 610 negative (T_NEG610) sequences. Using T_POS72 and T_NEG610 and the threshold defined over TDB_DIF and TDB_WS we found an accuracy of 74.0%, with associated area under the receiver operator characteristic curve of 0.738. The datasets used in the development and evaluation of ParCrys can be found here |
The following histogram gives the OB-Score distribution for TargetDB sequences associated with diffraction quality crystals (TDB_DIF), and the distribution for sequences where work has been stopped before crystals were obtained (TDB_WS). The set of sequences associated with diffraction-quality crystals (TDB_DIF) are significantly enriched in high OB-Scores, compared to the sequences where work was stopped before crystals were obtained (TDB_WS). Also, the set of "work stopped" (TDB_WS) sequences are significantly enriched in low OB-Scores compared to the "diffraction-quality crystals" (TDB_DIF) sequences. |
|
Please contact gjbarton "at" dundee "dot" ac "dot" uk
This work was funded by BBSRC