Next: System and methods
Up: Technical_report
Previous: Technical_report
The biological community is well served by primary databases. For
example, the EMBL nucleotide sequence database
[Baker et al., 2000] and GenBank [Benson et al., 2000] effectively
collate the growth in new nucleic acid sequences, while the
Protein Data Bank - PDB [Bernstein et al., 1977,Berman et al., 2000]
gathers new protein and nucleic acid three dimensional structures
into a common format. In order to make these primary data useful
for many different types of analysis, significant processing is
necessary. This processing may add extra information to the
primary data in the form of annotations, but it often includes the
definition of binary or multi-way relationships in the data.
Relationships, e.g. an alignment of two or more sequences in a
database, are higher level descriptions of the data that may
reference two or more database entries.
A database that includes relationships or other data not found in
the primary database is referred to as a derived, or second level
database. Examples of derived databases in molecular biology
include the PRINTS-S motif database
[Attwood et al., 2000,Attwood and Beck, 1994], the SCOP
[Murzin et al., 1995,Lo Conte et al., 2000], CATH
[Orengo et al., 1997,Pearl et al., 2000] and Dali/FSSP
[Holm and Sander, 1998] structural classification databases, the
HSSP alignment database [Sander and Schneider, 1991,Holm and Sander, 1999], the ProDom
protein domain families [Sonnhammer and Kahn, 1994,Corpet et al., 2000],
Pfam containing multiple alignments and hidden Markov model based
profiles of protein domains
[Sonnhammer et al., 1997,Bateman et al., 2000] etc.
The creation of a derived database can require considerable effort
in developing source code to process the data and in manual
checking of assignments, relationships and annotations. If the
primary databases were static, it would be relatively simple to
create a complex derived database. However, most primary databases
in biology are growing rapidly. So a derived database must be
re-created at regular intervals, to remain comprehensive and up to
date. Ideally, a derived database would be updated automatically
whenever the primary database was modified. Unfortunately, this
ideal can be difficult to achieve since even with the simplest of
derived databases some manual intervention is normally required.
The task is then to minimise the degree of human input and provide
tools which ease the human part of the update process.
This report describes, how these problems are handled by the 3Dee
database which is a derived database in two senses: it contains a
set of definitions for protein structural domains
in the PDB and it stores relationships between the domains in
form of a hierarchy. Adding, removing, or modifying structures in
the primary database, the PDB, requires a complex set of
operations to be performed in order to update the derived data
and relationships in 3Dee. While this technical report is specific
to the problem of creating and maintaining a domains database and
their hierarchy, knowledge of the problems overcome is of general
utility for setting up and maintaining derived databases of all
kinds.
Next: System and methods
Up: Technical_report
Previous: Technical_report
Uwe Dengler,
2000-10-16