A Fortran program (BRKSEQ), reads the PDB files and processes the necessary information to describe each protein entry by up to eleven different types of Prolog clauses:
header(Ident,List). compnd(Ident,List). source(Ident,List). resolution(Ident,R). chain(Ident,Chcode). nchains(Ident,N). chain_range(Chcode,[Cstart,CstartIN],[Cend,CendIN]). chain_length(Chcode,Len). residues(Chcode). no_mainchain(Chcode). no_sidechains(Chcode).
not all clauses need be present for a particular entry, as shown for the Immunoglobulin structure 2fb4.
header(2fb4,[immunoglobulin,18-apr-89,2fb4]). compnd(2fb4,[immunoglobulin,fab]). source(2fb4,[human,(homo,sapiens),myeloma,patient,kol,serum]). resolution(2fb4, 1.900). nchains(2fb4, 2). chain(2fb4,2fb4l). chain_range(2fb4l,[ 1,-],[ 214,-]). chain_length(2fb4l, 216). residues(2fb4l). chain(2fb4,2fb4h). chain_range(2fb4h,[ 1,-],[ 221,-]). chain_length(2fb4h, 229). residues(2fb4h).
The header, compnd, source and resolution clauses are
extracted directly from the information stored at the beginning of every PDB
file. The Ident is the PDB identification code for the protein (e.g.
9lyz), List is a Prolog list containing textual information, and R is the resolution of the structure in Ångstroms. The remaining seven
clauses are derived from an analysis of the PDB ATOM records. The chain
clauses link the PDB identification code to the chain code Chcode whilst
nchains simply lists how many chains are present in the PDB entry. The
nchains clause is included for simplicity, though is strictly unnecessary
since a Prolog rule could be used to count the number of chain clauses
present for each protein. For every chain clause, there is one chain_range clause which specifies the starting and ending residue numbers of
the chain. Similarly, there is a chain_length clause that states the
number of residues present in the chain (this clause is essential due to the
alphanumeric residue numbering scheme used by the PDB). The residues
clause identifies a chain as having amino acid residues other than UNK (or
X), whilst the presence of no_mainchain or no_sidechains clauses
for a chain shows that the protein entry is incomplete (some PDB entries only
contain mainchain, or atoms).