Next: New and old data
Up: System and methods
Previous: System and methods
The design considerations for 3Dee were that it should be a
comprehensive database of structural domains (1); allow
alternative domain definitions for the same protein (2); organise
the domains in a structural hierarchy (3); contain non-redundant
set(s) of sequences and structures (4); store multiple structure
alignments for all domain structure families (5); include derived
information such as secondary structure definitions (6); be
straightforward to update (7) and allow previous versions of the
database to be regenerated (8). The major design challenges in the
face of a rapidly growing PDB were (7) and (8).
Figure 1 gives an overview of the stages in the
creation of the data in 3Dee. Sequence similarity between chains
allows chains to be clustered into sequence families (1), but some
chains in the same sequence family may have different numbers of
domains. For example, an immunoglobulin variable domain might be
paired with a chain containing both a variable and constant
domain. Accordingly, the sequence families are divided into
similar domain organisation families (2), i.e. chains with the
same number of equivalent domains. These chains are then split
into domain families (3). Representatives from each domain
family are clustered by sequence similarity to give domain
sequence families (4) which provide a set of representative
domains that are non-redundant on sequence. Finally domain
structure families (5) cluster these representatives according to
the similarity of their three dimensional structure. Different
thresholds of structural similarity give rise to a hierarchy of
structurally related domains. A detailed analysis of the data in
3Dee and a description of the various levels of the database has
been given recently [Dengler et al., 2000].
Figure 1:
In this flow chart, different
shapes represent different domains. Domains with the same shape
have the same structure; domains with the same shape and fill
share structural and sequence similarity. It is illustrated, how
chains grouped in sequence families (1); are split into similar domain organisation families (2) which are separated into
domains to give domain families (3). The domains in the domain families are clustered by sequence to produce domain
sequence families (4) and then clustered by structure to form
domain structure families (5).
|
The process of creating and updating the database can be
summarised by dataflow diagrams
(Figures 2-4) that show the
relationship between a source of data and its repository or user.
Conventionally, items between two parallel solid lines are data
stores or sources, typically input/output devices, while those in
circles are processes that manipulate and transform the data. Such
diagrams also illustrate how processes and data are dependent on
one another. Figure 2 shows how various data
are extracted from the PDB files. PDBC data concerns the chains in
a PDB file, PDBSEQ data, the amino acid sequence of the chains,
UNIQUE data the residue numbering and PDBINFO data contains more
general information like the number of amino acid and nucleic acid
residues, the number of residues with missing atoms, the method of
structure determination etc. In Figure 3 the
relationships between the data derived from the PDB files,
sequence families and the process of defining domains are
illustrated. After domains have been defined for all chains in the
database, similar domain organisation families and domain families
are created. Figure 4 shows the flow of data
from the domain families to the domain structure families via the
domain sequence families.
Currently, the 3Dee database is structured using the Unix file
system to organise formatted text files that contain all
information. These files are accessed via file-processing
applications written in Perl and C. The requirement that 3Dee
keeps previous versions of the database accessible, is managed by
a simple, custom revision control system. However, revision
control systems developed for source code management, e.g. CVS
[Molli, 1998], may provide a more flexible way to track new
releases of the primary data, though with a doubling of disk space
requirements.
The details of creating and updating the 3Dee database are
complex. In this report, a solution to the problem of adding new
data without regenerating the complete database and to the
problem of minimising human input which is essential to provide
consistent domain definitions are described.
Next: New and old data
Up: System and methods
Previous: System and methods
Uwe Dengler,
2000-10-16