Reference Domain Definitions

Next: Optimisation of Parameters Up: Materials and Methods Previous: Implementation

Reference Domain Definitions

A set of 275 non-redundant protein structures were derived from the Brookhaven database. The non-redundancy is based on sequence rather than structure so some structures from the same family appear in the set. The structures were examined by Dr R. B. Russell (per. comm.) and subjectively split into domains using knowledge of protein folds and on the basis that domains are globular units which are distinct from the rest of the structure. For proteins in this set which contained more than one domain, the literature was searched for domain definitions in the original publications that described the crystal structure. This set is referred to as the reference set as shown in Table 1. Table 1 also shows which definitions were derived from the literature (identified by a after the name).

It was not possible to produce DSSP [Kabsch & Sander, 1983] files for 40 of the structures. CONTACTS could not be run on a further 4 structures as it requires all atoms to be present in the file. DOMAK, in its current form, has not been designed to deal with structures in which domains are made up of more than one chain. Therefore, kallikrein A was excluded from the set. However, it is conceptually simple to extend DOMAK to handle this case. The final set of protein structures analysed was 230. DOMAK required 16.5 hours of CPU to complete the analysis on this set, giving an average time of 4.3 minutes per protein. Calculation of contacts requires less than 2 minutes for the largest proteins (glycogen phosphorylase, 823 residues, took 101 seconds) and just over a second for the smaller ones (metallothionein isoform II, 62 residues; 1 second).

as@bioch.ox.ac.uk