\documentstyle[12pt,times]{article} \pagestyle{empty} \newcommand{\al}{\mbox{$\alpha$\,--~}} \newcommand{\be}{\mbox{$\beta$\,--~}} \newcommand{\tten}{\mbox{${\rm 3_{10}}$\,--~}} \newcommand{\ca}{\mbox{${\rm Ca^{2+}}$}} \newcommand{\ea}{\mbox{\em et al. \/}} \newcommand{\pla}{\mbox{${\rm PLA_{2}}$ }} \newcommand{\Cal}{\mbox{${\rm C}_{\alpha}$}} \begin{document} % End of preamble and beginning of text. \begin{bf} A PROLOG Approach to Analysing Protein Structure\\ \end{bf} \vskip 0.5in \begin{large} Geoffrey J. Barton\/$^{1,2}$ and Christopher J. Rawlings\/$^{2}$,\\ \vskip 0.5in $^{1}$University of Oxford; Laboratory of Molecular Biophysics\\ The Rex Richards Building; South Parks Road; Oxford OX1 3QU - UK (geoff@uk.ac.ox.biop)\\ and\\ $^{2}$Biomedical Computing Unit;Imperial Cancer Research Fund\\ Lincoln's Inn Fields;London \\ \vskip 0.25in \end{large} keywords: logic programming, PROLOG, protein structure, secondary structure, \\ topology, relational database, knowledge base \begin{abstract} This paper provides a detailed description of a database of protein structure implemented in the logic programming language PROLOG. The database allows flexible access to structural information at the atom, residue, secondary structure and topology levels of the protein structural hierarchy. An extended version of the Kabsch and Sander algorithm for secondary structure definition has been implemented in PROLOG, forming an integral part of the database. For protein structure analysis, the PROLOG system shows significant advantages in flexibility over conventional programming languages such as Fortran, and Relational Database Managements Systems using SQL. \end{abstract} \section{Introduction} Comparative analysis of the protein three-dimensional structures determined by X-ray crystallography or NMR, furthers our understanding of the factors influencing the native protein fold and suggests rules to guide the prediction of structure and function from the amino acid sequence. Traditionally, these analyses have been performed with bespoke Fortran programs that access flat files of coordinate data. Attention has focussed recently on using commercial Relational Database Management Systems (RDBMS) to store the coordinate and derived structural information (e.g. torsion angles, accessibilities, secondary structure, etc.) whilst making use of the SQL language to query the data \cite{islam89,huysmans91}. However, both traditional Fortran and RDBMS are limited by the inflexibility of the data storage format and the query language. Whilst in principal all queries can be answered through a conventional programming language such as Fortran or C, the effort involved in coding the question can be formidable. Furthermore, if the first question leads on to another, then a similar programming project must be undertaken to answer the follow-up query. In contrast, the RDBMS systems provide ready access to simple queries of the data without the need for complex programming. Unfortunately, the query language SQL can only represent simple tabular data structures, and the underlying relational model, though well suited to tables of names and addresses, wages etc., does not cope well with the naturally sequential protein structural data. Gray \ea \cite{gray90} have described an object-oriented database of protein structure. Their system explores the advantages of a system developed in the logic programming language Prolog. The relational data structures and unification based retrieval used by Prolog provide flexible ways of accessing structural data, and Prolog's symbolic (rule-based) programming style enables many aspects of structure analysis to be represented directly. Our work on representing protein structures in Prolog described in this paper has developed out of the successful initial investigations into using logic programming techniques to represent and search for topological motifs \cite{rawlings85}. Subsequent studies have made use of symbolic logical descriptions of protein structures during protein structure prediction \cite{clark90}, as a common platform in the PAPAIN project to develop a knowledge-based environment for the interpretation of protein sequence data \cite{Clark290}, and for the prediction of protein topology \cite{clark91}. Here, we extend the original representation of protein structure which was centered around protein topology, to include more detailed structural information. Particular attention has been paid to the representation of hydrogen bonding patterns, and the systematic classification of protein secondary structure using Kabsch and Sander's definitions \cite{DSSP}. \subsection{The Brookhaven Protein Data Bank} All protein structural data was extracted from the Brookhaven Protein Data Bank (PDB) \cite{brookhaven1}. The PDB acts as a repository for macromolecular structure information, primarily the results of X-ray crystallographic analysis. Each molecular structure (e.g. a protein structure) is stored in an individual flat file. The file entry contains the $x, y, z$ coordinates in a Cartesian coordinate system for every atom in the structure. Additional information regarding the experimental procedure used is also included together with the source of the biological material that has been analysed. \subsection{Introduction to Prolog, System, Syntax and Hardware} All programming was performed using the commercially available Quintus Prolog system (Release 2.4.2). General details of the Prolog language may be found in the book {\em Programming in Prolog} (Clocksin, W. F. and Mellish, C. S, Springer-Verlag, FRG, 1981). Briefly, Prolog stores facts and deductive rules in a database, where the facts are equivalent to a collection of relations in a relational database. The Prolog interpreter when presented with a query (or goal as it is often called) searches the database of facts and rules to see if any of the facts and rules match the query. If there are facts that match, then the variables in the query are unified with the constant symbols in the fact(s). If there is a rule whose consequent (the head of the rule) matches the query, then the interpreter proceeds by attempting to recursively prove the subgoals of the rule. Specific examples relevant to the protein structure information stored in our database are described below. The database system was originally developed on a Sun 3/50 workstation with 4MBytes of memory. The current installation is running on a Sun SPARCstation 1 with 12MBytes of memory. \subsection{Chain, Residue and Protein Naming Conventions} A protein coordinate entry (file) as listed in the Brookhaven Protein Data Bank (PDB) is uniquely identified by a 4 or 5 character code starting with an integer, for example, 1FB4 is the code for an immunoglobulin Fab fragment. However, any given entry may contain multiple protein chains, and these are usually the basic unit of protein structure. If more than one chain exists in an entry, the PDB assign a single character code to each chain. For example, the entry 1FB4 contains two chains labelled L and H. One often wishes to reference an individual chain rather than the whole protein entry. Accordingly, when converting the PDB to Prolog clauses, a two-level naming convention was adopted. Each Entry is identified by the PDB code (e.g. 1fb4) and each chain belonging to an entry is identified by the PDB code with the chain identifier code appended (e.g. 1fb4l). If an entry has a single chain that does not have a chain identifier, then the chain is simply identified by the PDB code alone. The PDB code and chain codes are linked by the Prolog clause {\tt chain/2} as described below. Residue numbers are represented by two-element lists (e.g. {\tt [37,a], [37,b], [38,-]}) in order to accommodate the insertion characters that allow homologous structures to be numbered in a manner consistent with the `parent' structure. For example, all serine proteinase enzymes (trypsin, elastase etc. ) are numbered according to the structure of chymotrypsinogen, which was the first member of the family to be solved by X-ray crystallography. Although this simplifies cross-referencing between different proteins in the same family, the flexibility creates difficulties when searching for residues $N$ amino acids before, or after the current position. \section{Simple Prolog Clauses Describing Brookhaven Entries} A Fortran program (BRKSEQ), reads the PDB files and processes the necessary information to describe each protein entry by up to eleven different types of Prolog clauses: \begin{verbatim} header(Ident,List). compnd(Ident,List). source(Ident,List). resolution(Ident,R). chain(Ident,Chcode). nchains(Ident,N). chain_range(Chcode,[Cstart,CstartIN],[Cend,CendIN]). chain_length(Chcode,Len). residues(Chcode). no_mainchain(Chcode). no_sidechains(Chcode). \end{verbatim} not all clauses need be present for a particular entry, as shown for the Immunoglobulin structure 2fb4. \begin{verbatim} header(2fb4,[immunoglobulin,18-apr-89,2fb4]). compnd(2fb4,[immunoglobulin,fab]). source(2fb4,[human,(homo,sapiens),myeloma,patient,kol,serum]). resolution(2fb4, 1.900). nchains(2fb4, 2). chain(2fb4,2fb4l). chain_range(2fb4l,[ 1,-],[ 214,-]). chain_length(2fb4l, 216). residues(2fb4l). chain(2fb4,2fb4h). chain_range(2fb4h,[ 1,-],[ 221,-]). chain_length(2fb4h, 229). residues(2fb4h). \end{verbatim} The {\tt header}, {\tt compnd}, {\tt source} and {\tt resolution} clauses are extracted directly from the information stored at the beginning of every PDB file. The {\tt Ident} is the PDB identification code for the protein (e.g. 9lyz), {\tt List} is a Prolog list containing textual information, and {\tt R} is the resolution of the structure in \AA ngstroms. The remaining seven clauses are derived from an analysis of the PDB ATOM records. The {\tt chain} clauses link the PDB identification code to the chain code {\tt Chcode} whilst {\tt nchains} simply lists how many chains are present in the PDB entry. The {\tt nchains} clause is included for simplicity, though is strictly unnecessary since a Prolog rule could be used to count the number of {\tt chain} clauses present for each protein. For every {\tt chain} clause, there is one {\tt chain\_range} clause which specifies the starting and ending residue numbers of the chain. Similarly, there is a {\tt chain\_length} clause that states the number of residues present in the chain (this clause is essential due to the alphanumeric residue numbering scheme used by the PDB). The {\tt residues} clause identifies a chain as having amino acid residues other than UNK (or X), whilst the presence of {\tt no\_mainchain} or {\tt no\_sidechains} clauses for a chain shows that the protein entry is incomplete (some PDB entries only contain mainchain, or \Cal atoms). \subsection{Using the PDB Information} There are 7102 clauses defining the November 1990 release of the Brookhaven PDB. These provide a very simple database that may be read directly into the Prolog system and interrogated by writing simple Prolog queries. For example, the following query, also known as a {\em goal} could be typed in at the Prolog prompt ({\tt | ?-}). \begin{verbatim} | ?- resolution(PID,R),R < 2, R > 0,chain(PID,CID),residues(CID). \end{verbatim} \vskip 0.25in The query will return the protein identifier (PID), chain identifier (CID) and crystallographic resolution (R) for each chain in the databank whose entry is less than 2 \AA\ resolution and has amino acid residues deposited. The Prolog interpreter attempts to satisfy this query as follows. Firstly Prolog looks in its database for facts called {\tt resolution}. The first fact is found and the variables PID and R unified with the arguments of the clause. The value of R is then tested to see if it is less than 2. If it is, then the test is made to see if R is greater than zero. If this succeeds, then a {\tt chain} clause is looked up in the database that unifies with the current value of PID. Finally, if the {\tt chain} clause is found, a {\tt residues} clause is looked up in the database that unifies with the value of CID. The query can fail at any stage. For example, if no {\tt residues} fact is found that contains the current CID, then the goal fails. Prolog then starts a process of {\em backtracking} to search for a possible solution. The interpreter would first look for another {\tt chain} fact. If present, then this would unify CID with the value shown in the {\tt chain} fact, again the database would be checked for a corresponding {\tt residues} clause. If all elements of the query succeed, then the values of PID,CID and R are displayed. The entire query may be forced to search for alternative solutions by typing a semicolon. For example, the following are the first three solutions to the query shown. \begin{verbatim} | ?- resolution(PID,R),R < 2, R > 0,chain(PID,CID),residues(CID). PID = CID = 1alc, R = 1.7 ; %First solution found - type ';' % to force backtracking PID = 1amt, R = 1.5, CID = 1amta ; %Second solution - type ';' again for next solution PID = 1amt, R = 1.5, CID = 1amtb % ... and so on ... \end{verbatim} If we often want to select protein chains by the criteria shown in this query, it is simple to build the query into a general purpose Prolog rule. For example, the rule called {\tt select\_chains}: \begin{verbatim} Comments start with a % select_chains(Rmin,Rmax,CID):- resolution(PID,R), % look up resolution R < Rmax, % resolution below Rmax R > Rmin, % resolution above Rmin chain(PID,CID), % find a chain identifier for this PID residues(CID). % check the chain has residues \end{verbatim} We can now type: \begin{verbatim} | ?- select_chains(0,2,CID). \end{verbatim} at the prolog prompt to find out which chains satisfy our criteria. Having established this new rule, we can then use it in further queries. For example: \begin{verbatim} | ?- select_chains(2,3,CID),chain_length(CID,Len),Len >= 150. \end{verbatim} will return the chain identifier and length for chains belonging to entries that are of between 2 and 3 \AA\ resolution and where the chain length is at least 150 amino acids. \subsection{Getting All Solutions} As stated, the example queries require interactive prompting by typing a semicolon, to return more than one solution. However, there are two methods by which Prolog can return all possible solutions to a particular query. The first is to put the built-in predicate (also called a procedure) {\tt fail.} at the end of the query. This forces backtracking whenever it is encountered. In order that the results of each solution to the query are recorded, some write statements must be added. For example: \begin{verbatim} | ?- select_chains(2,3,CID),chain_length(CID,Len),Len >= 150, write(CID),nl, write(Len),nl,fail. \end{verbatim} will print the values of chain identifier and length that satisfy the query. The {\tt nl} predicate simply places a carriage-return character after each write. Part of the output of this query looks like this: \begin{verbatim} 8at1c 310 8atca 310 8atcc 310 \end{verbatim} A more useful type of output would be to write out the results in a form that can be directly re-used by Prolog as a set of clauses. For example: \begin{verbatim} | ?- select_chains(2,3,CID),chain_length(CID,Len),Len >= 150, writeq(special_chains(CID,Len)),writeq(.),nl,fail. \end{verbatim} the {\tt writeq} predicate makes sure that all syntax is correctly observed. Part of the output resulting from this query is shown below and can be read back into Prolog for further processing: \begin{verbatim} special_chains(5acn,754). special_chains(5adh,374). special_chains(5at1a,310). special_chains(5at1c,310). special_chains(5tln,316). \end{verbatim} In this example, there is no real advantage in writing the results in this form since the original query is very quickly evaluated. However, for a complex query where Prolog may take several minutes to find all solutions, it makes sense to save intermediate results in this form. This is particularly true in the analysis of protein structure, where one question frequently leads to another and the results of the first question are necessary to solve the following question. An alternative way to return all solutions to a query is to make use of the Prolog built-in procedure {\tt bagof}. This is an extremely flexible procedure that returns the results of a query in the form of a Prolog list. For example, the previous query could be expressed as: \begin{verbatim} | ?- bagof(special_chains(CID,Len), (select_chains(2,3,CID),chain_length(CID,Len)),L) \end{verbatim} and will return the variable {\tt L} as a list containing {\tt special\_chains(CID,Len)} clauses. Normally, this is the method of choice since it gathers all solutions into an easily manipulated list structure. This example required 0.08 seconds to scan a database containing 501 entries on a Sun SPARCstation 1. \section{Derived Information} So far we have described an extremely simple database of protein structure-related information and shown how Prolog can be used to query the data. This simple database forms a hub of basic facts around which the more detailed structural data of our system are organised. Although PDB coordinate files contain the $x, y, z$ coordinates of individual atoms in a protein, one frequently does not require this level of detail. It is often sufficient to consider the structure at the level of the amino acid residues. For example, one may be interested in the secondary structure of a particular residue, (\al, \be or turn), the accessibility to solvent, the $\phi/\psi$ angles and so on. The Kabsch and Sander secondary structure definition program DSSP \cite{DSSP}, reads a Brookhaven coordinate file and generates a file of definitions for the secondary structure of the protein. In addition to defining the structural state of each residue, DSSP also generates a table containing the main chain hydrogen bonds, accessibility, $\phi/\psi$ angles and \Cal coordinates together with other relevant information. Accordingly, the table provides an invaluable reference source of residue-level information for the protein. Simply converting these tables to Prolog clauses allows the power of Prolog to be used in queries of the data. A sample of the table output by DSSP is shown here: \begin{tiny} \begin{verbatim} 1 1 A V 0 0 139 0, 0.0 2,-0.6 0, 0.0 130, 0.0 0.000 360.0 360.0 360.0 130.6 6.9 17.8 4.7 2 2 A L - 0 0 19 125,-0.1 122, 0.0 126,-0.1 5, 0.0 -0.711 360.0-142.7 -80.5 121.9 10.6 17.9 4.2 3 3 A S > - 0 0 44 -2,-0.6 4,-2.0 1,-0.1 -1, 0.0 -0.301 27.8-105.5 -71.3 164.8 12.3 19.8 7.0 4 4 A P H > S+ 0 0 98 0, 0.0 4,-2.4 0, 0.0 -1,-0.1 0.880 125.4 55.4 -62.3 -35.0 15.3 22.0 6.2 5 5 A A H > S+ 0 0 59 1,-0.2 4,-2.4 2,-0.2 -3, 0.0 0.907 104.9 52.7 -65.0 -41.5 17.5 19.4 7.7 6 6 A D H > S+ 0 0 15 1,-0.2 4,-2.3 2,-0.2 -1,-0.2 0.918 109.0 50.3 -56.3 -42.8 15.9 16.9 5.2 7 7 A K H X S+ 0 0 46 -4,-2.0 4,-2.5 1,-0.2 -2,-0.2 0.881 109.2 50.2 -64.7 -36.2 16.7 19.2 2.4 8 8 A T H X S+ 0 0 90 -4,-2.4 4,-2.0 1,-0.2 -1,-0.2 0.910 111.6 49.6 -63.3 -42.7 20.4 19.6 3.5 \end{verbatim} \end{tiny} The program BRKSAND converts this table format into Prolog clauses, one for each residue. For example: \begin{tiny} \begin{verbatim} ks( 1,2hhb,2hhba,[ 1,-],V,-,[-,-,-],[ 0, 0,-], 139,[ 0, 0.0],[ 2,-0.6],[ 0, 0.0],[ 130, 0.0],[ 0.000, 360.0, 360.0],[ 360.0, 130.6],[ 6.9, 17.8, 4.7]). ks( 2,2hhb,2hhba,[ 2,-],L,-,[n,-,-],[ 0, 0,-], 19,[ 125,-0.1],[ 122, 0.0],[ 126,-0.1],[ 5, 0.0],[-0.711, 360.0,-142.7],[ -80.5, 121.9],[ 10.6, 17.9, 4.2]). ks( 3,2hhb,2hhba,[ 3,-],S,-,[n,-,-],[ 0, 0,-], 44,[ -2,-0.6],[ 4,-2.0],[ 1,-0.1],[ -1, 0.0],[-0.301, 27.8,-105.5],[ -71.3, 164.8],[ 12.3, 19.8, 7.0]). ks( 4,2hhb,2hhba,[ 4,-],P,H,[p,-,-],[ 0, 0,-], 98,[ 0, 0.0],[ 4,-2.4],[ 0, 0.0],[ -1,-0.1],[ 0.880, 125.4, 55.4],[ -62.3, -35.0],[ 15.3, 22.0, 6.2]). ks( 5,2hhb,2hhba,[ 5,-],A,H,[p,-,-],[ 0, 0,-], 59,[ 1,-0.2],[ 4,-2.4],[ 2,-0.2],[ -3, 0.0],[ 0.907, 104.9, 52.7],[ -65.0, -41.5],[ 17.5, 19.4, 7.7]). ks( 6,2hhb,2hhba,[ 6,-],D,H,[p,-,-],[ 0, 0,-], 15,[ 1,-0.2],[ 4,-2.3],[ 2,-0.2],[ -1,-0.2],[ 0.918, 109.0, 50.3],[ -56.3, -42.8],[ 15.9, 16.9, 5.2]). ks( 7,2hhb,2hhba,[ 7,-],K,H,[p,-,-],[ 0, 0,-], 46,[ -4,-2.0],[ 4,-2.5],[ 1,-0.2],[ -2,-0.2],[ 0.881, 109.2, 50.2],[ -64.7, -36.2],[ 16.7, 19.2, 2.4]). ks( 8,2hhb,2hhba,[ 8,-],T,H,[p,-,-],[ 0, 0,-], \end{verbatim} \end{tiny} The general form of a {\tt ks} fact is as follows: \begin{verbatim} ks(N,PID,CID,RNUM,AA,SS,CLAD,BPSHEET,ACC,NHO1,ONH1,NHO2,ONH2,TKA,PHIPSI,XYZ). \end{verbatim} where: \begin{verbatim} N unique atom identifier in this entry (integer) PID Protein identifer code (character string) CID Chain identifer code ( " " ) RNUM residue number List: [integer,character insert code] AA amino acid code (Uppercase character - cys in bridge is lowercase) SS secondary structure (Uppercase character) CLAD [chirality,ladder1,ladder2] where chirality = n (-ve), p (+ve) ladder1/2 = uppercase character BPSHEET [BP1,BP2,SHeet] BP1/2 give values of N for bridges, Sheet is a character that labels the sheet ACC accessibility (Angstroms**2) (int) NHO1/2, ONH1/2 [Ndiff,energy] Ndiff is the difference in N value to the Hbond partner, energy is the energy (kcal/mol) (int,real) TKA Angles: [TCO,KAPPA,ALPHA] (All real numbers) PHIPSI [PHI,PSI] phi,psi angles (both real) XYZ [X,Y,Z] coordinates of this CA atom (all reals) \end{verbatim} For details of the specific structural meaning of each of these values, see Kabsch and Sander \cite{DSSP}. The first argument in the {\tt ks/16} clause is a unique numeric residue number, where the numbers run consecutively from the beginning of the protein entry. This numbering scheme overcomes the difficulties associated with the standard PDB numbering system, by allowing simple questions of the form, `what is the residue 4 amino acids N-terminal of residue 37 to be readily answered. If the {\tt ks} clauses for a protein are loaded into the Prolog system, then queries may be made. For example, to find the names of residues with accessibilities $< 100 {\rm \AA}^{2}$ and with positive $\phi$ angles one could type: \begin{verbatim} | ?- ks(_,_,_,_,AA,_,_,_,ACC,_,_,_,_,_,[PHI,_],_),ACC < 100, PHI >0. \end{verbatim} Clearly, this is a rather cumbersome interface to the data. In order to simplify this sort of query, a set of high level procedures have been written to allow different aspects of the data to be extracted. The procedure {\tt get/4} allows a particular field to be returned: \begin{verbatim} | ?- get(N,CID,Entry,Value) \end{verbatim} {\tt N} and {\tt CID} are the unique residue number and chain identifier. {\tt Entry} is one of the strings: {\tt ks\_number, protein\_identifier, residue\_number, amino\_acid, amino, accessibility, hbondNO1, hbondON1, hbondNO2, hbondON2, hbonds, tka, phipsi, xyz, x,y,z, sec}. These values follow the arguments of the {\tt ks} procedure, {\tt hbonds} causes backtracking through all four hbond types that are stored, whilst {\tt amino\_acid} converts the code letters for residues involved in disulphide bridges into the normal amino acid codes. {\tt Value} holds the returned value of the procedure. For example, to perform the same query as above we would type: \begin{small} \begin{verbatim} | ?- get(N,CID,accessibility,ACC),ACC < 100, get(N,CID,phipsi,[PHI,_]),PHI > 0. \end{verbatim} \end{small} which reads, `Look up an accessibilty value, if this value is less than 100 then look up the value of $\phi$ for the residue, if this is greater than 0, then display the result. Frequently, one requires a {\em range} of values that start and end at particular residues in the protein chain. The {\tt getR} procedure performs this task. For example: \begin{verbatim} | ?- getR(2,10,1fb4l,amino_acid,AA) \end{verbatim} which will return the value for {\tt AA} of: \begin{verbatim} AA = [S,V,L,T,Q,P,P,S,A] \end{verbatim} The {\tt getR} procedure can be used for far more complex queries. For example, the following query will find the sequence, accessibility and summary of secondary structure for all pentapeptides that have a mean accessibilty of less than $10 {\rm \AA}^{2}$. \begin{small} \begin{verbatim} CID = 1fb4l, %just set the chain to 1fb4l for now kchain_range(S,E,CID), %find the start (S) and end (E) residue numbers SM1 is S - 1, %find S - 1 number LST is E - 4, %find the last possible START for a pentapeptide next1(SM1,START,LST), %on backtracking this procedure returns START as %successive values from SM1 to LST. END is START + 4, %specify the last residue of the pentapeptide getR(START,END,CID,accessibility,ACC), %get the accessibilities for this %pentapeptide mean(MEAN,ACC), %find the mean accessibility MEAN < 10, %test if mean is < 10 for this pentapeptide, getR(START,END,CID,amino_acid,AA), %look up the amino acid sequence getR(START,END,CID,sec,SEC), % and summary of secondary structure write_list([CID,START,END,ACC,AA,SEC,MEAN]). %write out the information we want \end{verbatim} \end{small} This query produces the following output: \begin{verbatim} 1fb4l 32 36 [9,34,0,0,1] [I,T,V,N,W] [S,-,-,E,E] 8.79999 1fb4l 33 37 [34,0,0,1,0] [T,V,N,W,Y] [-,-,E,E,E] 7.0 1fb4l 34 38 [0,0,1,0,34] [V,N,W,Y,Q] [-,E,E,E,E] 7.0 1fb4l 35 39 [0,1,0,34,2] [N,W,Y,Q,Q] [E,E,E,E,E] 7.39999 1fb4l 72 76 [2,34,0,4,1] [A,S,L,A,I] [E,E,E,E,E] 8.2 1fb4l 85 89 [10,9,0,11,0] [S,D,Y,Y,C] [S,E,E,E,E] 6.0 1fb4l 86 90 [9,0,11,0,0] [D,Y,Y,C,A] [E,E,E,E,E] 4.0 1fb4l 87 91 [0,11,0,0,0] [Y,Y,C,A,S] [E,E,E,E,E] 2.2 1fb4l 88 92 [11,0,0,0,1] [Y,C,A,S,W] [E,E,E,E,E] 2.4 1fb4l 89 93 [0,0,0,1,0] [C,A,S,W,N] [E,E,E,E,E] 0.2 1fb4l 90 94 [0,0,1,0,34] [A,S,W,N,S] [E,E,E,E,T] 7.0 1fb4l 98 102 [14,5,11,11,2] [S,Y,V,F,G] [E,E,E,E,B] 8.59999 1fb4l 134 138 [3,7,0,0,0] [A,T,L,V,C] [E,E,E,E,E] 2.0 1fb4l 135 139 [7,0,0,0,2] [T,L,V,C,L] [E,E,E,E,E] 1.8 1fb4l 136 140 [0,0,0,2,0] [L,V,C,L,I] [E,E,E,E,E] 0.4 1fb4l 137 141 [0,0,2,0,13] [V,C,L,I,S] [E,E,E,E,E] 3.0 1fb4l 177 181 [2,0,1,0,5] [A,A,S,S,Y] [E,E,E,E,E] 1.6 1fb4l 178 182 [0,1,0,5,2] [A,S,S,Y,L] [E,E,E,E,E] 1.6 \end{verbatim} and required 18 seconds to complete on a Sun SPARCstation 1. Loading the protein into the Prolog system (consultation) required an additional 14 seconds. \section{Secondary Structure Definition} The Kabsch and Sander \cite{DSSP} secondary structure definition algorithm follows a strict hierarchy of structural sub-types. However, the {\tt ks} clauses store only the summary of the structure thus loosing the underlying information that goes to make up the defined secondary structure. For example, to define an \al helix, the DSSP algorithm first finds all main chain hydrogen bonds between residues $i$ and $i + 4$. These residues are then said to form a {\em four turn}. A {\em minimal helix} is then defined in terms of two overlapping four turns, finally a helix is defined as at least two overlapping minimal helices. When the secondary structural state of a residue is simplified to a single character such as `H', the reason for that residue being defined as `H' (i.e. hydrogen bonds, turns and minimal helices) is lost. Similarly, the possibility of a residue belonging to more than one structural class, for example both \tten and \al helix, is not allowed by the {\tt ks} summary. Prolog is a rule based language and the Kabsch and Sander definitions are a set of simple hierarchical rules. Accordingly, it is a relatively straightforward matter to implement the secondary structure definition rules in Prolog. Having the structure definitions expressed as Prolog rules enables queries to be formulated at each level of the definition hierarchy. \subsection{Helix definitions} In order to illustrate the definition of secondary structure using Kabsch and Sander's rules the complete Prolog code for the definition of helix structure is shown here. The basic building block for the definition of a helix is the {\tt kturn} rule. This looks up the hydrogen bonding information stored in the {\tt ks} clauses and if an $i, i + n$ pattern with an energy of at least -0.5 KCal is seen, then a turn is defined. \begin{verbatim} /* kturn rule*/ kturn(N,N3,CID,Type):- ks(N,_,CID,_,_,_,_,_,_,_,[B1,E1],_,[B2,E2],_,_,_), turn_type(NN,Type), N3 is N + NN, B1A is N + B1, B2A is N + B2, check_bond(N3,[B1A,E1],[B2A,E2]). /* turn_type facts */ turn_type(3,three_turn). turn_type(4,four_turn). turn_type(5,five_turn). turn_type(6,six_turn). /* check_bond rules */ check_bond(N3,[N3,E1],_):- E1 =< -0.5. check_bond(N3,_,[N3,E2]):- E2 =< -0.5. \end{verbatim} The {\tt kturn} rule operates by first looking up the unique residue number, offset and energy of the two hydrogen bonds listed as O-HN hydrogen bonds for that residue. A turn type is then looked up (e.g. three\_turn) to obtain the required offset for a hydrogen bond (in this example, 3). The {\tt check\_bond} rule is then used to see if a hydrogen bond exists between the current residue and one three further down the polypeptide chain. Given the {\tt kturn}, the definition of a minimal helix is readily expressed as: \begin{verbatim} minimal_helix(N,N2,CID,Type):- kturn(N,Nend2,CID,Type), NM1 is N - 1, kturn(NM1,_,CID,Type), N2 is Nend2 - 1. \end{verbatim} Prolog rules then define the start and end points of a helix: \begin{verbatim} helix_start(N,CID,Type):- minimal_helix(N,_,CID,Type), NM1 is N - 1, \+ in_minimal_helix(NM1,CID,Type). % \+ means 'not' helix_end(N,CID,Type):- minimal_helix(_,N,CID,Type), NP1 is N + 1, \+ in_minimal_helix(NP1,CID,Type). in_minimal_helix(N,CID,Type):- nonvar(N), %must call with N instantiated minimal_helix(N1,N2,CID,Type), %i.e. N must have a value before N >= N1, % the call. N =< N2, !. %succeed only once. \end{verbatim} The rule {\tt in\_minimal\_helix} succeeds once if the residue number is found within a minimal helix. The {\tt helix\_start} rule reads `residue $N$ is the first residue in a helix if it is the first residue in a minimal helix, and residue $N - 1$ is not in a minimal helix'. Similarly for the {\tt helix\_end} rule. Three {\tt helix\_type} facts link the different types of turn with \tten, \al and five- helix names. \begin{verbatim} helix_type(three_turn,three_ten). helix_type(four_turn,alpha). helix_type(five_turn,five). \end{verbatim} Finally, we can write the {\tt helix} rule by making use of the {\tt helix\_start} and {\tt helix\_end} rules. \begin{verbatim} helix(N1,N2,CID,Htype):- helix_start(N1,CID,Type), find_helix_end(N1,CID,Type,N2), helix_type(Type,Htype). find_helix_end(N,CID,Type,N2):- helix_end(N2,CID,Type), N2 > N, !. \end{verbatim} The {\tt helix} rule reads `find the start of a helix, find the first end of helix that follows this start, then look up the helix type.' The Prolog rules for helix, make up the computer program that defines helix structures. However, a feature of Prolog is that any rule can be replaced by a collection of facts. For example, the {\tt kturn} rule can be replaced or augmented by facts that are specific to a particular protein: \begin{verbatim} kturn(416,419,1fb4h,three_turn). kturn(418,421,1fb4h,three_turn). kturn(418,422,1fb4h,four_turn). kturn(430,434,1fb4h,four_turn). kturn(430,435,1fb4h,five_turn). kturn(431,434,1fb4h,three_turn). \end{verbatim} these facts may then be used in exactly the same way as the general purpose rule with the same name and number of arguments. In this way, if a rule is particularly time consuming to evaluate, it need only be evaluated once for the protein, then stored as a set of facts. Other rules that subsequently make use of the rule need only look up the corresponding facts in the database, rather than repeat the time-consuming rule evaluation. This principal is similar to storing intermediary results in a conventional Fortran or C program. The difference in Prolog is that the routines that access the pre-calculated data are identical to those that access the routines that initially calculated the data. \subsection{Beta Structure Definitions} Kabsch and Sander define all beta structure in terms of `bridges' which are either parallel or antiparallel. Where two or more bridges of the same type are consecutive, the structure is termed a ladder. Finally, overlapping ladders are amalgamated into sheets. Additional complications arise because ladders may have discontinuities in them, and ladders may consists of just a single bridge. These aspects of protein structure make the coding of beta-structure in Prolog a little less straightforward than for helix. In finding the sheets in a protein, the following steps are performed: \begin{itemize} \item Define the parallel and antiparallel bridges. \item Define start and end points of ladders. \item Build ladders of that consist of two or more consecutive bridges. \item Identify ladders that consist of a single bridge (bridge\_ladder). \item Locate ladder pairs that are linked by a \be bulge. \item Identify \be sheets in terms of the ladders that make them up. \item Identify \be strands by examining the sheets. \end{itemize} The general Prolog rules developed for \be structure are: \begin{verbatim} bridge(I-J,CID1,CID2,Btype) ladder_start(A-B,CID1,CID2,Btype) ladder_end(A-B,CID1,CID2,Btype) ladder(BridgeList,CID1,CID2,Ltype) bridge_ladder(ladder([A-B],CID1,CID2,Type)) bulge_linked_ladders(Ladder1,Ladder2) l_sheet(PID,LadderList,Type) strand(NumberList,CID) s_sheet(PID,StrandList,Type) \end{verbatim} Where {\tt I-J} is a pair of residue identifiers from chains {\tt CID1, CID2} respectively. {\tt Btype} is the bridge type, either {\tt parallel} or {\tt antiparallel}. {\tt BridgeList} is a Prolog List containing {\tt bridge/4} facts. Similarly, LadderList is a list of {\tt ladder/4} definitions, whilst NumberList is a list of residue numbers. The {\tt bulge\_linked\_ladders/2} rule identifies two ladders that are linked by a \be bulge. Examples of these facts specific to the protein 1fb4 are shown here: \begin{small} \begin{verbatim} bridge(200-205,1fb4l,1fb4l,antiparallel) bridge(228-340,1fb4h,1fb4h,parallel) ladder([4-101,4-102,4-103],1fb4l,1fb4l,parallel) ladder([35-90,36-89,37-88,38-87,39-86],1fb4l,1fb4l,antiparallel) l_sheet(1fb4, [ladder([4-101,4-102,4-103],1fb4l,1fb4l,parallel), ladder([8-106,9-106,10-107,11-108,12-109,12-110],1fb4l,1fb4l,parallel), ladder([35-90,36-89,37-88,38-87,39-86],1fb4l,1fb4l,antiparallel), ladder([36-49,37-47,38-46],1fb4l,1fb4l,antiparallel), ladder([85-107,86-106,87-105],1fb4l,1fb4l,antiparallel), ladder([89-102,90-101,91-100,92-99,93-98],1fb4l,1fb4l,antiparallel) ],mixed) s_sheet(1fb4, [strand([4],1fb4l), strand([8,9,10,11,12],1fb4l), strand([35,36,37,38,39],1fb4l), strand([46,47,48,49],1fb4l), strand([85,86,87,88,89,90,91,92,93],1fb4l), strand([98,99,100,101,102,103],1fb4l), strand([105,106,107,108,109,110],1fb4l) ],mixed) \end{verbatim} \end{small} Two alternative representations of a \be sheet are shown. The first consists of the list of {\tt ladder} clauses that are used in the definition of the sheet. This representation immediately shows which residues are involved in the sheet and what their hydrogen bonding partners are. The sheet type is also defined as one of pure parallel, pure antiparallel, or mixed, depending upon the type of ladders in the sheet. Having established the sheet definition, it is possible to define the \be strands that make up the sheet and consequently to define the second alternative sheet fact. This shows a list of strand definitions in place of the ladder definitions. The time required to calculate all secondary structure definitions for a protein is dependent upon the number of residues present, and the total secondary structure content. Some typical examples are: 1fb4 (Immunoglobulin) a $\beta$ protein of 445 residues in total takes 50 seconds (including consultation time); 1mbn (Myoglobin) an all $\alpha$ protein of 153 residues takes 25 seconds on a SPARCstation 1. \section{Interface to TOPOL Topology Reasoning Rules} The TOPOL system for reasoning about protein topology in Prolog \cite{rawlings85} makes use of the secondary structure definitions as deposited in the Brookhaven Protein Data Bank. In order that the TOPOL rules may be applied to Kabsch and Sander derived secondary structure definitions and to allow access to the angle and accessibility information, several rules were developed. Kabsch and Sander helix definitions allow a residue to belong to more than one type of helix. Thus, it often occurs that a region of \al helix will overlap with a region of \tten helix at the C-terminal end of the \al helix. TOPOL expects residues to belong to only one secondary structure. Accordingly, where overlaps occur, the helix definitions are compressed into a single {\tt thelix} definition. For example: \begin{verbatim} thelix(186,191,1fb4l,[ helix(189,191,1fb4l,three_ten), helix(186,191,1fb4l,alpha) ]) \end{verbatim} shows the starting and ending residue numbers for the concatenated helix, the chain identifier, then a list of the helix definitions that overlap. Strands are simply identified by their start and end residue numbers, rather than a list of all residues in the strand. In addition, regions of polypeptide that are neither in strand, nor helix are defined as the structure {\tt tloop}. The TOPOL clauses {\tt follows/2}, {\tt is\_parallel\_to/2} and {\tt is\_antiparallel\_to/2} are then defined in terms of the tstrand, tloop and thelix clauses. For example: \begin{verbatim} follows(tstrand(175,184,1fb4l),tloop(172,174,1fb4l)) \end{verbatim} specifies that the given {\tt tstrand} follows the {\tt tloop} in the structure, as will be self evident from the residue numbers. The full interface to TOPOL includes calls from Prolog to Fortran routines to fit straight lines through helices and strands and to calculate overlaps, distances and angles. The details of this interface are under further development and will be described elsewhere. \section{Extensions to Kabsch and Sander Secondary Structure Definitions} \subsection{Extending \be strands} The basic Kabsch and Sander definitions for beta structure require that a residue must be involved in two hydrogen bonds (or bordered by residues involved in two hydrogen bonds) in order to be classified as in a `bridge'. Only residues in bridges can be built into ladders, and hence into sheets. However, residues frequently make a single hydrogen bond at the end of a ladder and would traditionally be considered as part of the sheet. The {\tt ladder\_extension} clause: \begin{verbatim} ladder_extension(L2,ladder(L1,CID1,CID2,Type)). \end{verbatim} succeeds if the ladder may be extended to either the `left' or `right' end. L2 is the list L1 with the additional hydrogen bonded residue pair appended to the appropriate end or ends. When Type $=$ antiparallel, the extension is straightforward, eg: 20-119,21-118 might be extended to 19-120,20-119,21-118, or similarly at the right hand end. However, when Type $=$ parallel, there is a proble