Next: Implementation
Up: System and methods
Previous: Tracking domains
Most of the database creation and update processes (sequence
comparison, clustering, structure comparison etc.) are performed
automatically. However, there are no automatic methods which are
able to define structural domains accurately and consistently in
all proteins. Multi-chain domains present an especially difficult
problem for automatic methods. Thus, defining domains had to be
carried out, in part, by eye or with reference to the literature.
Sequence alignment of chains within a sequence family allows some
domains to be defined automatically by similarity. As illustrated
in Figure 6, also some of the
automatic definitions have subsequently to be checked by eye. In
order to simplify this process http-based client-server software
to control creation, editing and updating of domain definitions in
the database was developed.
Figure 6:
Domain definition
(domain I: red; domain II: green) for chain 1mela (top) of the
single-domain antibody [Desmyter et al., 1996]. The definition was created
automatically by sequence alignment with chain 1igch [Derrick and Wigley, 1994]
from the same sequence family (bottom). As the amino acid residues
of these immunoglobulin (Ig) molecules are dissimilar in one of
the hypervariable regions and the template 1igch consists of two
Ig domains, the sequences of 1mela and 1igch could not be aligned
properly and the automatic domain definition failed. Instead of
having two domains, the definition for 1mela should consist of a
single domain.
|
A web browser provides a good solution to maintaining the domain
definitions for five reasons: the multi-line text entry forms of
the Hypertext Markup Language (HTML) allow data to be entered and
modified in the database (1); it is easy to ``link'' to related
information in order to assist decisions on the correct domain
definition, for example, to see how corresponding domains have
been defined in a sequence similar protein (2); consistency and
error checking programs can be incorporated directly into the
domain definition process to prevent incorrect data from entering
the database (3); domain definitions may be checked simultaneously
by different people in different locations (4); once domain
definitions have been produced, they can be viewed and checked
using a RasMol [Sayle and Milner-White, 1995] WWW/PDB interface (5).
In Figure 7 the main part of the chain
page for L-lactate dehydrogenase from Lactobacillus casei
1llc [Bühner and Hecht, 1987] is shown. After general information and the
domain definition, there are links to alternative domain
definitions, the sequence family and a link to modify the domain
definitions. This is followed by links providing access to PDB
related information like the header of the PDB file and the likely
quaternary structure calculated by the PQS server
[Henrick and Thornton, 1998]. At the bottom of the page other
sources of protein domain definitions, i.e. the CATH
[Orengo et al., 1997], FSSP [Holm and Sander, 1998] and SCOP
[Murzin et al., 1995] protein structure classifications can be accessed
(section of page not shown).
During a database update, for every chain in the database a
``chain page'' can be accessed via a search engine or via another
web page listing all chains without domain definition. The chain
page provides general information about a protein (e.g. compound,
author, resolution) and about the chain (e.g. start and end
residues, number of residues). Existing domain definitions may be
checked via a link to the RasMol interface (View domains) and
updated or a new domain definition may be created.
If no domain definition exists, one can be typed in the text entry
form of the ``Domain Maker'' web page shown in
Figure 8. In case a domain definition already
exists, it can either be edited directly by using the ``Domain
Editor'' shown in Figure 9 or it can be
updated using the ``Domain Updater'' page illustrated in
Figure 10. Changing domain definitions via
the Domain Updater has the advantage that existing definitions are
retained. If a domain definition is edited directly in the Domain
Editor, the domain identifier is not changed and the database
programs will assume that the domain definition has remained
unchanged. This is useful, if a minor correction has been made to
a definition. In case of the Domain Maker and the Domain Updater,
the update scripts will derive a new identifier from the initials
of the annotator automatically.
Several consistency and error checks are carried out, before any
data are added to the database. The format of the data is checked
to verify that the contents have valid syntax, no duplicate
identifiers, and that any identifiers specified as default or
equally valid domain definitions do indeed exist. Checks are also
carried out to ensure that the domain definitions are in a valid
format: start residues come before end residues, there are no
overlapping segments, and all residues mentioned do exist.
Since domain definitions are made chain by chain, multi-chain
domain definitions are repeated for every chain they contain.
Default or equally valid definitions of multi-chain domains are
compared for all chains that are part of the multi-chain domain.
Differences between the domain definitions are reported and the
annotator is requested to fix the problem. Errors that are not
fixed, are detected by a checking script at the next stage of the
update process. To prevent this kind of error completely, newly
entered default or equally valid multi-chain domain definitions
are copied for all chains which are part of the definition and for
which no domain definition exists.
At the top of Figure 11 some example error
messages are shown, while at the bottom, a page indicating that
the update was successful is illustrated. The consistency and
error checking tools minimise the human input necessary during a
database update. Auto-checking and easy visualisation of domains
lowers the chance of erroneous domain definitions being added to
the database.
Next: Implementation
Up: System and methods
Previous: Tracking domains
Uwe Dengler,
2000-10-16