So far we have described an extremely simple database of protein structure-related information and shown how Prolog can be used to query the data. This simple database forms a hub of basic facts around which the more detailed structural data of our system are organised.
Although PDB coordinate files contain the coordinates of individual
atoms in a protein, one frequently does not require this level of detail. It
is often sufficient to consider the structure at the level of the amino acid
residues. For example, one may be interested in the secondary structure of a
particular residue, (
- ,
- or turn), the accessibility to solvent, the
angles and so on. The Kabsch and Sander secondary structure
definition program DSSP [8], reads a Brookhaven coordinate file and
generates a file of definitions for the secondary structure of the protein. In
addition to defining the structural state of each residue, DSSP also generates
a table containing the main chain hydrogen bonds, accessibility,
angles and
coordinates together with other relevant information.
Accordingly, the table provides an invaluable reference source of residue-level
information for the protein. Simply converting these tables to Prolog clauses
allows the power of Prolog to be used in queries of the data. A sample of the
table output by DSSP is shown here:
1 1 A V 0 0 139 0, 0.0 2,-0.6 0, 0.0 130, 0.0 0.000 360.0 360.0 360.0 130.6 6.9 17.8 4.7 2 2 A L - 0 0 19 125,-0.1 122, 0.0 126,-0.1 5, 0.0 -0.711 360.0-142.7 -80.5 121.9 10.6 17.9 4.2 3 3 A S > - 0 0 44 -2,-0.6 4,-2.0 1,-0.1 -1, 0.0 -0.301 27.8-105.5 -71.3 164.8 12.3 19.8 7.0 4 4 A P H > S+ 0 0 98 0, 0.0 4,-2.4 0, 0.0 -1,-0.1 0.880 125.4 55.4 -62.3 -35.0 15.3 22.0 6.2 5 5 A A H > S+ 0 0 59 1,-0.2 4,-2.4 2,-0.2 -3, 0.0 0.907 104.9 52.7 -65.0 -41.5 17.5 19.4 7.7 6 6 A D H > S+ 0 0 15 1,-0.2 4,-2.3 2,-0.2 -1,-0.2 0.918 109.0 50.3 -56.3 -42.8 15.9 16.9 5.2 7 7 A K H X S+ 0 0 46 -4,-2.0 4,-2.5 1,-0.2 -2,-0.2 0.881 109.2 50.2 -64.7 -36.2 16.7 19.2 2.4 8 8 A T H X S+ 0 0 90 -4,-2.4 4,-2.0 1,-0.2 -1,-0.2 0.910 111.6 49.6 -63.3 -42.7 20.4 19.6 3.5
The program BRKSAND converts this table format into Prolog clauses, one for each residue. For example:
ks( 1,2hhb,2hhba,[ 1,-],V,-,[-,-,-],[ 0, 0,-], 139,[ 0, 0.0],[ 2,-0.6],[ 0, 0.0],[ 130, 0.0],[ 0.000, 360.0, 360.0],[ 360.0, 130.6],[ 6.9, 17.8, 4.7]). ks( 2,2hhb,2hhba,[ 2,-],L,-,[n,-,-],[ 0, 0,-], 19,[ 125,-0.1],[ 122, 0.0],[ 126,-0.1],[ 5, 0.0],[-0.711, 360.0,-142.7],[ -80.5, 121.9],[ 10.6, 17.9, 4.2]). ks( 3,2hhb,2hhba,[ 3,-],S,-,[n,-,-],[ 0, 0,-], 44,[ -2,-0.6],[ 4,-2.0],[ 1,-0.1],[ -1, 0.0],[-0.301, 27.8,-105.5],[ -71.3, 164.8],[ 12.3, 19.8, 7.0]). ks( 4,2hhb,2hhba,[ 4,-],P,H,[p,-,-],[ 0, 0,-], 98,[ 0, 0.0],[ 4,-2.4],[ 0, 0.0],[ -1,-0.1],[ 0.880, 125.4, 55.4],[ -62.3, -35.0],[ 15.3, 22.0, 6.2]). ks( 5,2hhb,2hhba,[ 5,-],A,H,[p,-,-],[ 0, 0,-], 59,[ 1,-0.2],[ 4,-2.4],[ 2,-0.2],[ -3, 0.0],[ 0.907, 104.9, 52.7],[ -65.0, -41.5],[ 17.5, 19.4, 7.7]). ks( 6,2hhb,2hhba,[ 6,-],D,H,[p,-,-],[ 0, 0,-], 15,[ 1,-0.2],[ 4,-2.3],[ 2,-0.2],[ -1,-0.2],[ 0.918, 109.0, 50.3],[ -56.3, -42.8],[ 15.9, 16.9, 5.2]). ks( 7,2hhb,2hhba,[ 7,-],K,H,[p,-,-],[ 0, 0,-], 46,[ -4,-2.0],[ 4,-2.5],[ 1,-0.2],[ -2,-0.2],[ 0.881, 109.2, 50.2],[ -64.7, -36.2],[ 16.7, 19.2, 2.4]). ks( 8,2hhb,2hhba,[ 8,-],T,H,[p,-,-],[ 0, 0,-],
The general form of a ks fact is as follows:
ks(N,PID,CID,RNUM,AA,SS,CLAD,BPSHEET,ACC,NHO1,ONH1,NHO2,ONH2,TKA,PHIPSI,XYZ).
where:
N unique atom identifier in this entry (integer) PID Protein identifer code (character string) CID Chain identifer code ( " " ) RNUM residue number List: [integer,character insert code] AA amino acid code (Uppercase character - cys in bridge is lowercase) SS secondary structure (Uppercase character) CLAD [chirality,ladder1,ladder2] where chirality = n (-ve), p (+ve) ladder1/2 = uppercase character BPSHEET [BP1,BP2,SHeet] BP1/2 give values of N for bridges, Sheet is a character that labels the sheet ACC accessibility (Angstroms**2) (int) NHO1/2, ONH1/2 [Ndiff,energy] Ndiff is the difference in N value to the Hbond partner, energy is the energy (kcal/mol) (int,real) TKA Angles: [TCO,KAPPA,ALPHA] (All real numbers) PHIPSI [PHI,PSI] phi,psi angles (both real) XYZ [X,Y,Z] coordinates of this CA atom (all reals)
For details of the specific structural meaning of each of these values, see Kabsch and Sander [8].
The first argument in the ks/16 clause is a unique numeric residue number, where the numbers run consecutively from the beginning of the protein entry. This numbering scheme overcomes the difficulties associated with the standard PDB numbering system, by allowing simple questions of the form, `what is the residue 4 amino acids N-terminal of residue 37 to be readily answered.
If the ks clauses for a protein are loaded into the Prolog system, then
queries may be made. For example, to find the names of residues with
accessibilities and with positive
angles one could
type:
| ?- ks(_,_,_,_,AA,_,_,_,ACC,_,_,_,_,_,[PHI,_],_),ACC < 100, PHI >0.
Clearly, this is a rather cumbersome interface to the data. In order to simplify this sort of query, a set of high level procedures have been written to allow different aspects of the data to be extracted. The procedure get/4 allows a particular field to be returned:
| ?- get(N,CID,Entry,Value)
N and CID are the unique residue number and chain identifier. Entry is one of the strings: ks_number, protein_identifier, residue_number, amino_acid, amino, accessibility, hbondNO1, hbondON1, hbondNO2, hbondON2, hbonds, tka, phipsi, xyz, x,y,z, sec. These values follow the arguments of the ks procedure, hbonds causes backtracking through all four hbond types that are stored, whilst amino_acid converts the code letters for residues involved in disulphide bridges into the normal amino acid codes. Value holds the returned value of the procedure.
For example, to perform the same query as above we would type:
| ?- get(N,CID,accessibility,ACC),ACC < 100, get(N,CID,phipsi,[PHI,_]),PHI > 0.
which reads, `Look up an accessibilty value, if this value is less than 100
then look up the value of for the residue, if this is greater than 0,
then display the result.
Frequently, one requires a range of values that start and end at particular residues in the protein chain. The getR procedure performs this task. For example:
| ?- getR(2,10,1fb4l,amino_acid,AA)
which will return the value for AA of:
AA = [S,V,L,T,Q,P,P,S,A]
The getR procedure can be used for far more complex queries. For
example, the following query will find the sequence, accessibility and summary
of secondary structure for all pentapeptides that have a mean accessibilty of
less than .
CID = 1fb4l, %just set the chain to 1fb4l for now kchain_range(S,E,CID), %find the start (S) and end (E) residue numbers SM1 is S - 1, %find S - 1 number LST is E - 4, %find the last possible START for a pentapeptide next1(SM1,START,LST), %on backtracking this procedure returns START as %successive values from SM1 to LST. END is START + 4, %specify the last residue of the pentapeptide getR(START,END,CID,accessibility,ACC), %get the accessibilities for this %pentapeptide mean(MEAN,ACC), %find the mean accessibility MEAN < 10, %test if mean is < 10 for this pentapeptide, getR(START,END,CID,amino_acid,AA), %look up the amino acid sequence getR(START,END,CID,sec,SEC), % and summary of secondary structure write_list([CID,START,END,ACC,AA,SEC,MEAN]). %write out the information we want
This query produces the following output:
1fb4l 32 36 [9,34,0,0,1] [I,T,V,N,W] [S,-,-,E,E] 8.79999 1fb4l 33 37 [34,0,0,1,0] [T,V,N,W,Y] [-,-,E,E,E] 7.0 1fb4l 34 38 [0,0,1,0,34] [V,N,W,Y,Q] [-,E,E,E,E] 7.0 1fb4l 35 39 [0,1,0,34,2] [N,W,Y,Q,Q] [E,E,E,E,E] 7.39999 1fb4l 72 76 [2,34,0,4,1] [A,S,L,A,I] [E,E,E,E,E] 8.2 1fb4l 85 89 [10,9,0,11,0] [S,D,Y,Y,C] [S,E,E,E,E] 6.0 1fb4l 86 90 [9,0,11,0,0] [D,Y,Y,C,A] [E,E,E,E,E] 4.0 1fb4l 87 91 [0,11,0,0,0] [Y,Y,C,A,S] [E,E,E,E,E] 2.2 1fb4l 88 92 [11,0,0,0,1] [Y,C,A,S,W] [E,E,E,E,E] 2.4 1fb4l 89 93 [0,0,0,1,0] [C,A,S,W,N] [E,E,E,E,E] 0.2 1fb4l 90 94 [0,0,1,0,34] [A,S,W,N,S] [E,E,E,E,T] 7.0 1fb4l 98 102 [14,5,11,11,2] [S,Y,V,F,G] [E,E,E,E,B] 8.59999 1fb4l 134 138 [3,7,0,0,0] [A,T,L,V,C] [E,E,E,E,E] 2.0 1fb4l 135 139 [7,0,0,0,2] [T,L,V,C,L] [E,E,E,E,E] 1.8 1fb4l 136 140 [0,0,0,2,0] [L,V,C,L,I] [E,E,E,E,E] 0.4 1fb4l 137 141 [0,0,2,0,13] [V,C,L,I,S] [E,E,E,E,E] 3.0 1fb4l 177 181 [2,0,1,0,5] [A,A,S,S,Y] [E,E,E,E,E] 1.6 1fb4l 178 182 [0,1,0,5,2] [A,S,S,Y,L] [E,E,E,E,E] 1.6
and required 18 seconds to complete on a Sun SPARCstation 1. Loading the protein into the Prolog system (consultation) required an additional 14 seconds.