Next: System and methods Up: Technical_report Previous: Technical_report

Introduction

The biological community is well served by primary databases. For example, the EMBL nucleotide sequence database [Baker et al., 2000] and GenBank [Benson et al., 2000] effectively collate the growth in new nucleic acid sequences, while the Protein Data Bank - PDB [Bernstein et al., 1977,Berman et al., 2000] gathers new protein and nucleic acid three dimensional structures into a common format. In order to make these primary data useful for many different types of analysis, significant processing is necessary. This processing may add extra information to the primary data in the form of annotations, but it often includes the definition of binary or multi-way relationships in the data. Relationships, e.g. an alignment of two or more sequences in a database, are higher level descriptions of the data that may reference two or more database entries. A database that includes relationships or other data not found in the primary database is referred to as a derived, or second level database. Examples of derived databases in molecular biology include the PRINTS-S motif database [Attwood et al., 2000,Attwood and Beck, 1994], the SCOP [Murzin et al., 1995,Lo Conte et al., 2000], CATH [Orengo et al., 1997,Pearl et al., 2000] and Dali/FSSP [Holm and Sander, 1998] structural classification databases, the HSSP alignment database [Sander and Schneider, 1991,Holm and Sander, 1999], the ProDom protein domain families [Sonnhammer and Kahn, 1994,Corpet et al., 2000], Pfam containing multiple alignments and hidden Markov model based profiles of protein domains [Sonnhammer et al., 1997,Bateman et al., 2000] etc. The creation of a derived database can require considerable effort in developing source code to process the data and in manual checking of assignments, relationships and annotations. If the primary databases were static, it would be relatively simple to create a complex derived database. However, most primary databases in biology are growing rapidly. So a derived database must be re-created at regular intervals, to remain comprehensive and up to date. Ideally, a derived database would be updated automatically whenever the primary database was modified. Unfortunately, this ideal can be difficult to achieve since even with the simplest of derived databases some manual intervention is normally required. The task is then to minimise the degree of human input and provide tools which ease the human part of the update process. This report describes, how these problems are handled by the 3Dee database which is a derived database in two senses: it contains a set of definitions for protein structural domains in the PDB and it stores relationships between the domains in form of a hierarchy. Adding, removing, or modifying structures in the primary database, the PDB, requires a complex set of operations to be performed in order to update the derived data and relationships in 3Dee. While this technical report is specific to the problem of creating and maintaining a domains database and their hierarchy, knowledge of the problems overcome is of general utility for setting up and maintaining derived databases of all kinds.

Next: System and methods Up: Technical_report Previous: Technical_report

Uwe Dengler, 2000-10-16