Comparative analysis of the protein three-dimensional structures determined by X-ray crystallography or NMR, furthers our understanding of the factors influencing the native protein fold and suggests rules to guide the prediction of structure and function from the amino acid sequence. Traditionally, these analyses have been performed with bespoke Fortran programs that access flat files of coordinate data. Attention has focussed recently on using commercial Relational Database Management Systems (RDBMS) to store the coordinate and derived structural information (e.g. torsion angles, accessibilities, secondary structure, etc.) whilst making use of the SQL language to query the data [2][1]. However, both traditional Fortran and RDBMS are limited by the inflexibility of the data storage format and the query language. Whilst in principal all queries can be answered through a conventional programming language such as Fortran or C, the effort involved in coding the question can be formidable. Furthermore, if the first question leads on to another, then a similar programming project must be undertaken to answer the follow-up query. In contrast, the RDBMS systems provide ready access to simple queries of the data without the need for complex programming. Unfortunately, the query language SQL can only represent simple tabular data structures, and the underlying relational model, though well suited to tables of names and addresses, wages etc., does not cope well with the naturally sequential protein structural data.
Gray et al. [3] have described an object-oriented database of protein structure. Their system explores the advantages of a system developed in the logic programming language Prolog. The relational data structures and unification based retrieval used by Prolog provide flexible ways of accessing structural data, and Prolog's symbolic (rule-based) programming style enables many aspects of structure analysis to be represented directly.
Our work on representing protein structures in Prolog described in this paper has developed out of the successful initial investigations into using logic programming techniques to represent and search for topological motifs [4]. Subsequent studies have made use of symbolic logical descriptions of protein structures during protein structure prediction [5], as a common platform in the PAPAIN project to develop a knowledge-based environment for the interpretation of protein sequence data [6], and for the prediction of protein topology [7]. Here, we extend the original representation of protein structure which was centered around protein topology, to include more detailed structural information. Particular attention has been paid to the representation of hydrogen bonding patterns, and the systematic classification of protein secondary structure using Kabsch and Sander's definitions [8].