Next: Secondary Structure Definition Up: No Title Previous: Getting All Solutions

Derived Information

So far we have described an extremely simple database of protein structure-related information and shown how Prolog can be used to query the data. This simple database forms a hub of basic facts around which the more detailed structural data of our system are organised.

Although PDB coordinate files contain the coordinates of individual atoms in a protein, one frequently does not require this level of detail. It is often sufficient to consider the structure at the level of the amino acid residues. For example, one may be interested in the secondary structure of a particular residue, (- , - or turn), the accessibility to solvent, the angles and so on. The Kabsch and Sander secondary structure definition program DSSP [8], reads a Brookhaven coordinate file and generates a file of definitions for the secondary structure of the protein. In addition to defining the structural state of each residue, DSSP also generates a table containing the main chain hydrogen bonds, accessibility, angles and coordinates together with other relevant information. Accordingly, the table provides an invaluable reference source of residue-level information for the protein. Simply converting these tables to Prolog clauses allows the power of Prolog to be used in queries of the data. A sample of the table output by DSSP is shown here:



    1    1 A V              0   0  139    0, 0.0   2,-0.6   0, 0.0 130, 0.0   0.000 360.0 360.0 360.0 130.6    6.9   17.8    4.7
    2    2 A L        -     0   0   19  125,-0.1 122, 0.0 126,-0.1   5, 0.0  -0.711 360.0-142.7 -80.5 121.9   10.6   17.9    4.2
    3    3 A S     >  -     0   0   44   -2,-0.6   4,-2.0   1,-0.1  -1, 0.0  -0.301  27.8-105.5 -71.3 164.8   12.3   19.8    7.0
    4    4 A P  H  > S+     0   0   98    0, 0.0   4,-2.4   0, 0.0  -1,-0.1   0.880 125.4  55.4 -62.3 -35.0   15.3   22.0    6.2
    5    5 A A  H  > S+     0   0   59    1,-0.2   4,-2.4   2,-0.2  -3, 0.0   0.907 104.9  52.7 -65.0 -41.5   17.5   19.4    7.7
    6    6 A D  H  > S+     0   0   15    1,-0.2   4,-2.3   2,-0.2  -1,-0.2   0.918 109.0  50.3 -56.3 -42.8   15.9   16.9    5.2
    7    7 A K  H  X S+     0   0   46   -4,-2.0   4,-2.5   1,-0.2  -2,-0.2   0.881 109.2  50.2 -64.7 -36.2   16.7   19.2    2.4
    8    8 A T  H  X S+     0   0   90   -4,-2.4   4,-2.0   1,-0.2  -1,-0.2   0.910 111.6  49.6 -63.3 -42.7   20.4   19.6    3.5

The program BRKSAND converts this table format into Prolog clauses, one for each residue. For example:



ks(    1,2hhb,2hhba,[   1,-],V,-,[-,-,-],[   0,   0,-],
 139,[   0, 0.0],[   2,-0.6],[   0, 0.0],[ 130, 0.0],[ 0.000, 360.0, 360.0],[ 360.0, 130.6],[    6.9,   17.8,    4.7]).
ks(    2,2hhb,2hhba,[   2,-],L,-,[n,-,-],[   0,   0,-],
  19,[ 125,-0.1],[ 122, 0.0],[ 126,-0.1],[   5, 0.0],[-0.711, 360.0,-142.7],[ -80.5, 121.9],[   10.6,   17.9,    4.2]).
ks(    3,2hhb,2hhba,[   3,-],S,-,[n,-,-],[   0,   0,-],
  44,[  -2,-0.6],[   4,-2.0],[   1,-0.1],[  -1, 0.0],[-0.301,  27.8,-105.5],[ -71.3, 164.8],[   12.3,   19.8,    7.0]).
ks(    4,2hhb,2hhba,[   4,-],P,H,[p,-,-],[   0,   0,-],
  98,[   0, 0.0],[   4,-2.4],[   0, 0.0],[  -1,-0.1],[ 0.880, 125.4,  55.4],[ -62.3, -35.0],[   15.3,   22.0,    6.2]).
ks(    5,2hhb,2hhba,[   5,-],A,H,[p,-,-],[   0,   0,-],
  59,[   1,-0.2],[   4,-2.4],[   2,-0.2],[  -3, 0.0],[ 0.907, 104.9,  52.7],[ -65.0, -41.5],[   17.5,   19.4,    7.7]).
ks(    6,2hhb,2hhba,[   6,-],D,H,[p,-,-],[   0,   0,-],
  15,[   1,-0.2],[   4,-2.3],[   2,-0.2],[  -1,-0.2],[ 0.918, 109.0,  50.3],[ -56.3, -42.8],[   15.9,   16.9,    5.2]).
ks(    7,2hhb,2hhba,[   7,-],K,H,[p,-,-],[   0,   0,-],
  46,[  -4,-2.0],[   4,-2.5],[   1,-0.2],[  -2,-0.2],[ 0.881, 109.2,  50.2],[ -64.7, -36.2],[   16.7,   19.2,    2.4]).
ks(    8,2hhb,2hhba,[   8,-],T,H,[p,-,-],[   0,   0,-],

The general form of a ks fact is as follows:



ks(N,PID,CID,RNUM,AA,SS,CLAD,BPSHEET,ACC,NHO1,ONH1,NHO2,ONH2,TKA,PHIPSI,XYZ).

where:



N          unique atom identifier in this entry   (integer)
PID        Protein identifer code                 (character string)
CID        Chain identifer code                   (  "          "  )
RNUM       residue number  List: [integer,character insert code]
AA         amino acid code (Uppercase character - cys in bridge 
                                                       is lowercase)
SS         secondary structure (Uppercase character)
CLAD       [chirality,ladder1,ladder2]  where
                                        chirality = n (-ve), p (+ve)
                                        ladder1/2 = uppercase character
BPSHEET    [BP1,BP2,SHeet] BP1/2 give values of N for bridges, Sheet is
                                        a character that labels the sheet
ACC        accessibility (Angstroms**2) (int)
NHO1/2,
ONH1/2     [Ndiff,energy]  Ndiff is the difference in N value to the Hbond 
           partner, energy is the energy (kcal/mol)  (int,real)
TKA        Angles: [TCO,KAPPA,ALPHA]  (All real numbers)
PHIPSI     [PHI,PSI]  phi,psi angles (both real)
XYZ        [X,Y,Z]  coordinates of this CA atom  (all reals)

For details of the specific structural meaning of each of these values, see Kabsch and Sander [8].

The first argument in the ks/16 clause is a unique numeric residue number, where the numbers run consecutively from the beginning of the protein entry. This numbering scheme overcomes the difficulties associated with the standard PDB numbering system, by allowing simple questions of the form, `what is the residue 4 amino acids N-terminal of residue 37 to be readily answered.

If the ks clauses for a protein are loaded into the Prolog system, then queries may be made. For example, to find the names of residues with accessibilities and with positive angles one could type:



| ?- ks(_,_,_,_,AA,_,_,_,ACC,_,_,_,_,_,[PHI,_],_),ACC < 100, PHI >0.

Clearly, this is a rather cumbersome interface to the data. In order to simplify this sort of query, a set of high level procedures have been written to allow different aspects of the data to be extracted. The procedure get/4 allows a particular field to be returned:



| ?- get(N,CID,Entry,Value)

N and CID are the unique residue number and chain identifier. Entry is one of the strings: ks_number, protein_identifier, residue_number, amino_acid, amino, accessibility, hbondNO1, hbondON1, hbondNO2, hbondON2, hbonds, tka, phipsi, xyz, x,y,z, sec. These values follow the arguments of the ks procedure, hbonds causes backtracking through all four hbond types that are stored, whilst amino_acid converts the code letters for residues involved in disulphide bridges into the normal amino acid codes. Value holds the returned value of the procedure.

For example, to perform the same query as above we would type:



| ?- get(N,CID,accessibility,ACC),ACC < 100, get(N,CID,phipsi,[PHI,_]),PHI > 0.

which reads, `Look up an accessibilty value, if this value is less than 100 then look up the value of for the residue, if this is greater than 0, then display the result.

Frequently, one requires a range of values that start and end at particular residues in the protein chain. The getR procedure performs this task. For example:



| ?- getR(2,10,1fb4l,amino_acid,AA)

which will return the value for AA of:



AA = [S,V,L,T,Q,P,P,S,A]

The getR procedure can be used for far more complex queries. For example, the following query will find the sequence, accessibility and summary of secondary structure for all pentapeptides that have a mean accessibilty of less than .



CID = 1fb4l,                %just set the chain to 1fb4l for now
kchain_range(S,E,CID),      %find the start (S) and end (E) residue numbers
SM1 is S - 1,               %find S - 1 number
LST is E - 4,               %find the last possible START for a pentapeptide
next1(SM1,START,LST),       %on backtracking this procedure returns START as
                            %successive values from SM1 to LST.
END is START + 4,           %specify the last residue of the pentapeptide
getR(START,END,CID,accessibility,ACC),  %get the accessibilities for this
                                        %pentapeptide
mean(MEAN,ACC),             %find the mean accessibility
MEAN < 10,                  %test if mean is < 10 for this pentapeptide,
getR(START,END,CID,amino_acid,AA),      %look up the amino acid sequence
getR(START,END,CID,sec,SEC),            % and summary of secondary structure
write_list([CID,START,END,ACC,AA,SEC,MEAN]). %write out the information we want

This query produces the following output:



1fb4l 32 36 [9,34,0,0,1] [I,T,V,N,W] [S,-,-,E,E] 8.79999
1fb4l 33 37 [34,0,0,1,0] [T,V,N,W,Y] [-,-,E,E,E] 7.0
1fb4l 34 38 [0,0,1,0,34] [V,N,W,Y,Q] [-,E,E,E,E] 7.0
1fb4l 35 39 [0,1,0,34,2] [N,W,Y,Q,Q] [E,E,E,E,E] 7.39999
1fb4l 72 76 [2,34,0,4,1] [A,S,L,A,I] [E,E,E,E,E] 8.2
1fb4l 85 89 [10,9,0,11,0] [S,D,Y,Y,C] [S,E,E,E,E] 6.0
1fb4l 86 90 [9,0,11,0,0] [D,Y,Y,C,A] [E,E,E,E,E] 4.0
1fb4l 87 91 [0,11,0,0,0] [Y,Y,C,A,S] [E,E,E,E,E] 2.2
1fb4l 88 92 [11,0,0,0,1] [Y,C,A,S,W] [E,E,E,E,E] 2.4
1fb4l 89 93 [0,0,0,1,0] [C,A,S,W,N] [E,E,E,E,E] 0.2
1fb4l 90 94 [0,0,1,0,34] [A,S,W,N,S] [E,E,E,E,T] 7.0
1fb4l 98 102 [14,5,11,11,2] [S,Y,V,F,G] [E,E,E,E,B] 8.59999
1fb4l 134 138 [3,7,0,0,0] [A,T,L,V,C] [E,E,E,E,E] 2.0
1fb4l 135 139 [7,0,0,0,2] [T,L,V,C,L] [E,E,E,E,E] 1.8
1fb4l 136 140 [0,0,0,2,0] [L,V,C,L,I] [E,E,E,E,E] 0.4
1fb4l 137 141 [0,0,2,0,13] [V,C,L,I,S] [E,E,E,E,E] 3.0
1fb4l 177 181 [2,0,1,0,5] [A,A,S,S,Y] [E,E,E,E,E] 1.6
1fb4l 178 182 [0,1,0,5,2] [A,S,S,Y,L] [E,E,E,E,E] 1.6

and required 18 seconds to complete on a Sun SPARCstation 1. Loading the protein into the Prolog system (consultation) required an additional 14 seconds.

Next: Secondary Structure Definition Up: No Title Previous: Getting All Solutions

gjb@bioch.ox.ac.uk