Next: Discussion Up: No Title Previous: Scanning More than

All Atom Representation in Prolog

Although many questions may be answered by regarding the protein structure at the residue level, some analyses require access to the individual atomic coordinates. For example, the location of close approaches between residue sidechains to identify hydrophobic or electrostatic interactions. The analysis of all atoms creates several additional complications:

The need to cope with a greatly increased diversity of atom labelling.
Keep record of which atoms belong to which residue and distinguish between atoms that are in the protein chain, and those that are not.
Cope with non-protein atoms and groups that are often part of a Brookhaven coordinate entry: eg. water molecules, haeme, carbohydrate etc...
Combinatorial problems: Eg. searching for all close approaches is time consuming because there roughly 10x as many atoms as atoms...
Storage and memory problems: Full coordinate sets take up a lot of space.

A simple strategy for the representation of all-atom sets in Prolog was adopted, whereby each atom is represented by a Prolog fact of the form:



brk(I,RN,IN,ATYPE,CID,RTYPE,ATTYPE,XYZ)

where I is the atom number, RN is the residue number (eg. 2), IN is the residue number insertion code (eg. "-" for no insertion code); ATYPE is either atm, or het , for protein ATOM or HETATM records; CID is the chain identifier code (eg. "1fb4l"); RTYPE is the amino acid type in three letter code (eg. val); ATTYPE is the atom type as a list including the atom insertion code (eg. [cg1,-]) and XYZ is the atomic coordinates as a list. In the current implementation, the temperature factor, occupancy and footnote fields are not included.

The PDB CONECT records are converted to bond clauses where each clause has the form:



bond(I,J,Type)

signifying a bond between atoms I and J of type Type. Type may be one of the following:

covalent
hbond_da(I is donor, J is acceptor in hydrogen bond)
saltb_neg(I is negative partner in salt bridge)
hbond_ad(I is acceptor)
saltb_pos(I is positive partner)
This format of a PDB entry may be used directly for analysis in Prolog. For example, given the rule rdist/3 which returns the linear distance between two points in space, we can readily calculate distances between any pair of atoms, simply by typing:



| ?- brk(I,RN1,IN1,ATYPE1,CID1,RTYPE1,ATTYPE1,XYZ1),
     brk(J,RN2,IN2,ATYPE2,CID2,RTYPE2,ATTYPE2,XYZ2),
     J > I,
     rdist(XYZ1,XYZ2,Distance).

which returns as the first solution:



I = RN1 = RN2 = 1,
IN1 = IN2 = -,
ATYPE1 = ATYPE2 = atm,
CID1 = CID2 = 5chaa,
RTYPE1 = RTYPE2 = cys,
ATTYPE1 = [n,-],
XYZ1 = [40.935,13.504,1.417],
J = 2,
ATTYPE2 = [ca,-],
XYZ2 = [40.345,14.599,2.14],
Distance = 1.43871

It is a simple matter to restrict the distance search to all atoms of a particular type. For example, to search for close approaches between cys sulphur atoms:



| ?- brk(I,RN1,IN1,ATYPE1,CID1,cys,[sg,_],XYZ1),
     brk(J,RN2,IN2,ATYPE2,CID2,cys,[sg,_],XYZ2),
     J > I,
     rdist(XYZ1,XYZ2,Distance),
     Distance < 5.

I = 6,
RN1 = 1,
IN1 = IN2 = -,
ATYPE1 = ATYPE2 = atm,
CID1 = CID2 = 5chaa,
XYZ1 = [37.649,15.819,1.913],
J = 893,
RN2 = 122,
XYZ2 = [36.339,14.497,2.687],
Distance = 2.01565

or perhaps, to identify close approaches between water molecules and glutamate residues and write out the findings in a Prolog clausal form.



brk(I,RN1,IN1,het,CID1,hoh,ATTYPE1,XYZ1),
     brk(J,RN2,IN2,atm,CID2,glu,ATTYPE2,XYZ2),
     rdist(XYZ1,XYZ2,Distance),Distance < 3,
     writeq(water_glu(water(I,[RN1,IN1],ATTYPE1,CID1),
                      glu(J,[RN2,IN2],ATTYPE2,CID2),Distance)),
     nl,fail.



water_glu(water(3603,[554,-],[o,-],5chaa),glu(123,[20,-],[oe1,-],5chaa),2.93873)
water_glu(water(3606,[557,-],[o,-],5chaa),glu(2264,[70,-],[cb,-],5chab),2.98971)
water_glu(water(3638,[589,-],[o,-],5chaa),glu(1898,[21,-],[ca,-],5chab),2.76785)
water_glu(water(3638,[589,-],[o,-],5chaa),glu(1899,[21,-],[c,-],5chab),2.90702)
water_glu(water(3638,[589,-],[o,-],5chaa),glu(1902,[21,-],[cg,-],5chab),2.49055)
water_glu(water(3644,[595,-],[o,-],5chaa),glu(492,[70,-],[cb,-],5chaa),2.85485)
water_glu(water(3648,[599,-],[o,-],5chaa),glu(551,[78,-],[cb,-],5chaa),2.77389)
water_glu(water(3663,[614,-],[o,-],5chaa),glu(2263,[70,-],[o,-],5chab),2.87087)
water_glu(water(3680,[631,-],[o,-],5chaa),glu(1895,[20,-],[oe1,-],5chab),2.3201)
water_glu(water(3723,[674,-],[o,-],5chaa),glu(120,[20,-],[cb,-],5chaa),2.8711)
water_glu(water(3723,[674,-],[o,-],5chaa),glu(125,[21,-],[n,-],5chaa),2.8872)
water_glu(water(3724,[675,-],[o,-],5chaa),glu(2121,[49,-],[oe2,-],5chab),2.20559)

Consulting (loading into the Prolog system) the 3719 brk/8 clauses for protein 5cha took 46 seconds. The query then required 75 seconds to run. When the brk/8 clauses were compiled into the Prolog system, the execution time was reduced to 30 seconds. Unfortunately compilation required 162 seconds, leading to a net loss in overall execution time.

The ease with which these simple queries can be executed in Prolog, belies the complications that would be necessary to provide such flexibility in a conventional Fortran or C program. As for Prolog, the conventional program would first have to read in the complete dataset into the chosen internal representation of the data. A general purpose command parser would need to be written to enable the operator to tell the program which comparison was required. A general selection routine would also be required to enable the operator to choose which subset of atoms are required for the comparison. Whilst all these routines could certainly be provided in a Fortran program, Prolog provides a far more concise route to such analyses.

Next: Discussion Up: No Title Previous: Scanning More than

gjb@bioch.ox.ac.uk