Overview

Next: New and old data Up: System and methods Previous: System and methods

Overview

The design considerations for 3Dee were that it should be a comprehensive database of structural domains (1); allow alternative domain definitions for the same protein (2); organise the domains in a structural hierarchy (3); contain non-redundant set(s) of sequences and structures (4); store multiple structure alignments for all domain structure families (5); include derived information such as secondary structure definitions (6); be straightforward to update (7) and allow previous versions of the database to be regenerated (8). The major design challenges in the face of a rapidly growing PDB were (7) and (8). Figure 1 gives an overview of the stages in the creation of the data in 3Dee. Sequence similarity between chains allows chains to be clustered into sequence families (1), but some chains in the same sequence family may have different numbers of domains. For example, an immunoglobulin variable domain might be paired with a chain containing both a variable and constant domain. Accordingly, the sequence families are divided into similar domain organisation families (2), i.e. chains with the same number of equivalent domains. These chains are then split into domain families (3). Representatives from each domain family are clustered by sequence similarity to give domain sequence families (4) which provide a set of representative domains that are non-redundant on sequence. Finally domain structure families (5) cluster these representatives according to the similarity of their three dimensional structure. Different thresholds of structural similarity give rise to a hierarchy of structurally related domains. A detailed analysis of the data in 3Dee and a description of the various levels of the database has been given recently [Dengler et al., 2000].

**Figure 1:** In this flow chart, different shapes represent different domains. Domains with the same shape have the same structure; domains with the same shape and fill share structural and sequence similarity. It is illustrated, how chains grouped in *sequence families* (1); are split into *similar domain organisation families* (2) which are separated into domains to give *domain families* (3). The domains in the *domain families* are clustered by sequence to produce *domain sequence families* (4) and then clustered by structure to form *domain structure families* (5).
$\includegraphics[scale=0.65]{figures/flow-chart.ps}$

The process of creating and updating the database can be summarised by dataflow diagrams (Figures 2-4) that show the relationship between a source of data and its repository or user. Conventionally, items between two parallel solid lines are data stores or sources, typically input/output devices, while those in circles are processes that manipulate and transform the data. Such diagrams also illustrate how processes and data are dependent on one another. Figure 2 shows how various data are extracted from the PDB files. PDBC data concerns the chains in a PDB file, PDBSEQ data, the amino acid sequence of the chains, UNIQUE data the residue numbering and PDBINFO data contains more general information like the number of amino acid and nucleic acid residues, the number of residues with missing atoms, the method of structure determination etc. In Figure 3 the relationships between the data derived from the PDB files, sequence families and the process of defining domains are illustrated. After domains have been defined for all chains in the database, similar domain organisation families and domain families are created. Figure 4 shows the flow of data from the domain families to the domain structure families via the domain sequence families. Currently, the 3Dee database is structured using the Unix file system to organise formatted text files that contain all information. These files are accessed via file-processing applications written in Perl and C. The requirement that 3Dee keeps previous versions of the database accessible, is managed by a simple, custom revision control system. However, revision control systems developed for source code management, e.g. CVS [Molli, 1998], may provide a more flexible way to track new releases of the primary data, though with a doubling of disk space requirements. The details of creating and updating the 3Dee database are complex. In this report, a solution to the problem of adding new data without regenerating the complete database and to the problem of minimising human input which is essential to provide consistent domain definitions are described.

Next: New and old data Up: System and methods Previous: System and methods

Uwe Dengler, 2000-10-16