Assessing accuracy

Next: Searches with eleven Up: Results Previous: Results

Assessing accuracy

Structural similarity is a continuum and for some fold types opinions differ as to what constitutes ``similar''. For example, thioredoxin has a -sheet with helices packing on each side which superficially resembles a Rossmann fold domain. However, the topology of the sheet is different from a Rossmann fold: the connectivity is different, and it contains a mixture of parallel and antiparallel hairpins rather than all parallel. To build a detailed model of thioredoxin based on a Rossmann fold would be incorrect, but recognising that thioredoxin has a ``single sheet with helix on each side'' is still useful. For some folds, e.g. the -trefoils, there is no such ambiguity. We discuss the accuracy of our method using two grades of success `strict' and `loose', which are outlined in Table 5. Strict similarities are those where the topology of the structure in the database is nearly an exact match of that found in the query (e.g. plastocyanin and azurin). Loose similarities are those where the topologies are broadly similar, with additional secondary structures in one fold relative to another, and with some differences in topological ordering or orientation of equivalent secondary structure elements (e.g. plastocyanin and an Ig fold). Strict similarities tend to correspond with those specified by scop [Murzin et al., 1995], whereas the loose similarities tend to correspond roughly with those identified by CATH [Orengo et al., 1993] and by the assessors of the protein structure prediction challenge [Lemer et al., 1996].

For comparison, we also scanned the same eleven queries against the database of domains using the fold recognition program THREADER [Jones et al., 1992] with default parameters.
In addition to the recognition of the correct fold, it is important to consider how well the query is aligned onto the database structure. Two measures of alignment accuracy are given: a) the fraction of correct residue equivalences found by each method % Res-Res, and b) the fraction of correctly overlapping secondary structure elements found % Sec-Sec. Secondary structures were considered correctly matched if at least two residues from structurally equivalent secondary structures overlapped in the alignment generated by each method. % Res-Res is a strict definition, and broadly measures how accurate a 3D model would be if based on the alignment found. % Sec-Sec is a looser definition, and allows for slippages of secondary structures and thus indicates the accuracy of the predicted topology. The second measure is arguably a more reliable guide, since for many pairs of similar protein structures, alignments of sequence based on 3D structure are ambiguous. Problems arise when assessing the symmetrical barrel structures. Shifting the alignment of secondary structure elements by one unit can lead to zero accuracy by these measures, though the resulting structure is largely correct. We thus report average accuracies with and without the barrels. To assess the overall alignment accuracies of each method, only those strict similarities that were not detectable by a sensitive sequence comparison algorithm [Barton, 1993] were considered. Similarities excluded were those with the globins, 1ECA, 1HBG and 1MYGA when scanning with Sea Hare Myoglobin, and that with 1PAZ when scanning with plastocyanin. For all other examples, accuracies were included in the calculation of an average, regardless of whether the similarity was found at or near the top of the ranked lists. A total of 36 strict similarities were used in the calculation.

Next: Searches with eleven Up: Results Previous: Results

gjb@bioch.ox.ac.uk