Defining domains

Next: Implementation Up: System and methods Previous: Tracking domains

Defining domains

Most of the database creation and update processes (sequence comparison, clustering, structure comparison etc.) are performed automatically. However, there are no automatic methods which are able to define structural domains accurately and consistently in all proteins. Multi-chain domains present an especially difficult problem for automatic methods. Thus, defining domains had to be carried out, in part, by eye or with reference to the literature. Sequence alignment of chains within a sequence family allows some domains to be defined automatically by similarity. As illustrated in Figure 6, also some of the automatic definitions have subsequently to be checked by eye. In order to simplify this process http-based client-server software to control creation, editing and updating of domain definitions in the database was developed.

$\includegraphics[scale=0.9]{figures/1mela_rot.eps}$

**Figure 6:** Domain definition (domain I: red; domain II: green) for chain 1mela (top) of the single-domain antibody [Desmyter *et al.*, 1996]. The definition was created automatically by sequence alignment with chain 1igch [Derrick and Wigley, 1994] from the same sequence family (bottom). As the amino acid residues of these immunoglobulin (Ig) molecules are dissimilar in one of the hypervariable regions and the template 1igch consists of two Ig domains, the sequences of 1mela and 1igch could not be aligned properly and the automatic domain definition failed. Instead of having two domains, the definition for 1mela should consist of a single domain.
$\includegraphics[scale=0.9]{figures/1igch_rot.eps}$

A web browser provides a good solution to maintaining the domain definitions for five reasons: the multi-line text entry forms of the Hypertext Markup Language (HTML) allow data to be entered and modified in the database (1); it is easy to ``link'' to related information in order to assist decisions on the correct domain definition, for example, to see how corresponding domains have been defined in a sequence similar protein (2); consistency and error checking programs can be incorporated directly into the domain definition process to prevent incorrect data from entering the database (3); domain definitions may be checked simultaneously by different people in different locations (4); once domain definitions have been produced, they can be viewed and checked using a RasMol [Sayle and Milner-White, 1995] WWW/PDB interface (5). In Figure 7 the main part of the chain page for L-lactate dehydrogenase from Lactobacillus casei 1llc [Bühner and Hecht, 1987] is shown. After general information and the domain definition, there are links to alternative domain definitions, the sequence family and a link to modify the domain definitions. This is followed by links providing access to PDB related information like the header of the PDB file and the likely quaternary structure calculated by the PQS server [Henrick and Thornton, 1998]. At the bottom of the page other sources of protein domain definitions, i.e. the CATH [Orengo et al., 1997], FSSP [Holm and Sander, 1998] and SCOP [Murzin et al., 1995] protein structure classifications can be accessed (section of page not shown). During a database update, for every chain in the database a ``chain page'' can be accessed via a search engine or via another web page listing all chains without domain definition. The chain page provides general information about a protein (e.g. compound, author, resolution) and about the chain (e.g. start and end residues, number of residues). Existing domain definitions may be checked via a link to the RasMol interface (View domains) and updated or a new domain definition may be created. If no domain definition exists, one can be typed in the text entry form of the ``Domain Maker'' web page shown in Figure 8. In case a domain definition already exists, it can either be edited directly by using the ``Domain Editor'' shown in Figure 9 or it can be updated using the ``Domain Updater'' page illustrated in Figure 10. Changing domain definitions via the Domain Updater has the advantage that existing definitions are retained. If a domain definition is edited directly in the Domain Editor, the domain identifier is not changed and the database programs will assume that the domain definition has remained unchanged. This is useful, if a minor correction has been made to a definition. In case of the Domain Maker and the Domain Updater, the update scripts will derive a new identifier from the initials of the annotator automatically. Several consistency and error checks are carried out, before any data are added to the database. The format of the data is checked to verify that the contents have valid syntax, no duplicate identifiers, and that any identifiers specified as default or equally valid domain definitions do indeed exist. Checks are also carried out to ensure that the domain definitions are in a valid format: start residues come before end residues, there are no overlapping segments, and all residues mentioned do exist. Since domain definitions are made chain by chain, multi-chain domain definitions are repeated for every chain they contain. Default or equally valid definitions of multi-chain domains are compared for all chains that are part of the multi-chain domain. Differences between the domain definitions are reported and the annotator is requested to fix the problem. Errors that are not fixed, are detected by a checking script at the next stage of the update process. To prevent this kind of error completely, newly entered default or equally valid multi-chain domain definitions are copied for all chains which are part of the definition and for which no domain definition exists. At the top of Figure 11 some example error messages are shown, while at the bottom, a page indicating that the update was successful is illustrated. The consistency and error checking tools minimise the human input necessary during a database update. Auto-checking and easy visualisation of domains lowers the chance of erroneous domain definitions being added to the database.

Next: Implementation Up: System and methods Previous: Tracking domains

Uwe Dengler, 2000-10-16