Next: References Up: No Title Previous: All Atom Representation

Discussion

In this paper we have described the use of Prolog to represent and manipulate protein structure and illustrated the use of the system to refine the Kabsch and Sander definitions of - structure. As it stands, the system is a practical tool that offers flexible access to structural information at all levels of the protein structural hierarchy. The system is fast enough to enable scans to be made of subsets of the database at the residue level, however, for simple queries, the time required to load each protein into Prolog (consultation) dominates the scan time. For example, although 18 minutes was required to scan 94 proteins for the amino acid sequence Gly-Gly and return the secondary structure summary and accessibility for the residues, 94%of the time was used for consultation. The time required to consult all 525 proteins in the current databank at the residue level is approximately 90 minutes, whilst scanning with all atoms would require approximately 5 hours consultation time on a SPARCstation 1.

There are a number of possible ways in which the consultation time could be reduced. The most common, and that usually offered by Prolog vendors is an interface to a relational database such as Oracle. These interfaces are described as providing a loose coupling to Prolog, since they effectively replace the standard file-based methods of retrieving Prolog facts (ground clauses) into the Prolog internal database. Access to SQL is provided from within Prolog, and these hybrid database/Prolog systems have been shown to be effective when Prolog is used to simplify access to an underlying database, or where the database retrievals are infrequent. A number of examples of this approach can be found in [10] and [11].

Although the loosely coupled interface to an RDBMS provides an engineered solution to the problem of managing large collections of data from a Prolog programming environment, a much better solution is to use a tightly coupled approach. Tight coupling between logic programming languages and large storage management systems exist in the class of systems called deductive databases [12] or expert databases [13] These systems are programming and data management systems based on principles of symbolic logic and do not require the user to access data via a standard query language such as SQL. The database and the deductive engine co-exist using common storage and execution models. This is a much more satisfactory approach, and in our view the best suited to applications in protein sequence and structure analysis.

Object-oriented databases (OODB) are perhaps a better known way to combine a computational paradigm with a database. OODBs combine the object-oriented programming style of specifying methods, and passing messages to activate methods stored in objects to manipulate data stored as properties of objects. In OODBs the object classes, objects and methods are maintained in a persistent storage system. A good example of the use of OODBs in the domain of protein structure is that of Gray et al. [3] who implemented their OODB in Prolog augmented with a custom-built object storage module. Although OODBs and deductive databases aim to deliver similar functionality to the user, the deductive database approach is more suitable for the development of knowledge-based systems because both data representation and the computational paradigm are based on well-founded theories of symbolic logic. Data and rules of deduction can also be freely intermixed whereas no equivalent theoretical basis exists for OODB and the traditional distinction between data and program in imperative programming languages is preserved.

Putting the database scanning problem to one side, the examples shown in this paper illustrate that Prolog is a useful tool for the analysis of protein structure. Once the relevant Prolog clauses have been loaded, queries regarding one protein can be evaluated in a few seconds. Indeed, it is possible to browse the protein structure, examining distances, angles, hydrogen bonds etc. with a simplicity that would be difficult to rival by conventional programming means. Unfortunately, in order to take advantage of these benefits with the current implementation, one needs to learn Prolog, and even seasoned "C" or Fortran programmers usually find this a barrier. The provision of a toolkit of high level functions specifically aimed at protein structure analysis, for example, torsion angle/distance calculation, extraction of helices etc. greatly reduces this barrier. Alternatively, the use of higher level languages developed with a particular problem domain in mind (e.g. Daplex, [3]) can ease the transition to Prolog-like systems. Ultimately, developments to graphical interfaces which can allow inexperienced users access to data structures describing complex concepts such as protein topology [14] will provide flexible access to Prolog-level queries.

Next: References Up: No Title Previous: All Atom Representation

gjb@bioch.ox.ac.uk